diff mbox series

[v7,7/7] docs: Document Syscall User Dispatch

Message ID 20201118032840.3429268-8-krisman@collabora.com (mailing list archive)
State New
Headers show
Series Syscall User Dispatch | expand

Commit Message

Gabriel Krisman Bertazi Nov. 18, 2020, 3:28 a.m. UTC
Explain the interface, provide some background and security notes.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 .../admin-guide/syscall-user-dispatch.rst     | 87 +++++++++++++++++++
 1 file changed, 87 insertions(+)
 create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst

Comments

Florian Weimer Nov. 18, 2020, 8:48 a.m. UTC | #1
* Gabriel Krisman Bertazi:

> +Interface
> +---------
> +
> +A process can setup this mechanism on supported kernels
> +CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl:
> +
> +  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
> +
> +<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
> +disable the mechanism globally for that thread.  When
> +PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
> +
> +<offset> and <offset+length> delimit a closed memory region interval
> +from which syscalls are always executed directly, regardless of the
> +userspace selector.  This provides a fast path for the C library, which
> +includes the most common syscall dispatchers in the native code
> +applications, and also provides a way for the signal handler to return
> +without triggering a nested SIGSYS on (rt_)sigreturn.  Users of this
> +interface should make sure that at least the signal trampoline code is
> +included in this region. In addition, for syscalls that implement the
> +trampoline code on the vDSO, that trampoline is never intercepted.
> +
> +[selector] is a pointer to a char-sized region in the process memory
> +region, that provides a quick way to enable disable syscall redirection
> +thread-wide, without the need to invoke the kernel directly.  selector
> +can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other
> +value should terminate the program with a SIGSYS.

Is this a process property or a task/thread property?  The last
paragraph says “thread-wide”, but the first paragraph says “process”.
Gabriel Krisman Bertazi Nov. 18, 2020, 5:02 p.m. UTC | #2
Florian Weimer <fw@deneb.enyo.de> writes:

> * Gabriel Krisman Bertazi:
>
>> +Interface
>> +---------
>> +
>> +A process can setup this mechanism on supported kernels
>> +CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl:
>> +
>> +  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
>> +
>> +<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
>> +disable the mechanism globally for that thread.  When
>> +PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
>> +
>> +<offset> and <offset+length> delimit a closed memory region interval
>> +from which syscalls are always executed directly, regardless of the
>> +userspace selector.  This provides a fast path for the C library, which
>> +includes the most common syscall dispatchers in the native code
>> +applications, and also provides a way for the signal handler to return
>> +without triggering a nested SIGSYS on (rt_)sigreturn.  Users of this
>> +interface should make sure that at least the signal trampoline code is
>> +included in this region. In addition, for syscalls that implement the
>> +trampoline code on the vDSO, that trampoline is never intercepted.
>> +
>> +[selector] is a pointer to a char-sized region in the process memory
>> +region, that provides a quick way to enable disable syscall redirection
>> +thread-wide, without the need to invoke the kernel directly.  selector
>> +can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other
>> +value should terminate the program with a SIGSYS.
>
> Is this a process property or a task/thread property?  The last
> paragraph says “thread-wide”, but the first paragraph says “process”.

It is per-thread, as it doesn't survive across clone/fork syscalls.  I
will fix the first paragraph of this text.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst
new file mode 100644
index 000000000000..e2fb36926f97
--- /dev/null
+++ b/Documentation/admin-guide/syscall-user-dispatch.rst
@@ -0,0 +1,87 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Syscall User Dispatch
+=====================
+
+Background
+----------
+
+Compatibility layers like Wine need a way to efficiently emulate system
+calls of only a part of their process - the part that has the
+incompatible code - while being able to execute native syscalls without
+a high performance penalty on the native part of the process.  Seccomp
+falls short on this task, since it has limited support to efficiently
+filter syscalls based on memory regions, and it doesn't support removing
+filters.  Therefore a new mechanism is necessary.
+
+Syscall User Dispatch brings the filtering of the syscall dispatcher
+address back to userspace.  The application is in control of a flip
+switch, indicating the current personality of the process.  A
+multiple-personality application can then flip the switch without
+invoking the kernel, when crossing the compatibility layer API
+boundaries, to enable/disable the syscall redirection and execute
+syscalls directly (disabled) or send them to be emulated in userspace
+through a SIGSYS.
+
+The goal of this design is to provide very quick compatibility layer
+boundary crosses, which is achieved by not executing a syscall to change
+personality every time the compatibility layer executes.  Instead, a
+userspace memory region exposed to the kernel indicates the current
+personality, and the application simply modifies that variable to
+configure the mechanism.
+
+There is a relatively high cost associated with handling signals on most
+architectures, like x86, but at least for Wine, syscalls issued by
+native Windows code are currently not known to be a performance problem,
+since they are quite rare, at least for modern gaming applications.
+
+Since this mechanism is designed to capture syscalls issued by
+non-native applications, it must function on syscalls whose invocation
+ABI is completely unexpected to Linux.  Syscall User Dispatch, therefore
+doesn't rely on any of the syscall ABI to make the filtering.  It uses
+only the syscall dispatcher address and the userspace key.
+
+Interface
+---------
+
+A process can setup this mechanism on supported kernels
+CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl:
+
+  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
+
+<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
+disable the mechanism globally for that thread.  When
+PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
+
+<offset> and <offset+length> delimit a closed memory region interval
+from which syscalls are always executed directly, regardless of the
+userspace selector.  This provides a fast path for the C library, which
+includes the most common syscall dispatchers in the native code
+applications, and also provides a way for the signal handler to return
+without triggering a nested SIGSYS on (rt_)sigreturn.  Users of this
+interface should make sure that at least the signal trampoline code is
+included in this region. In addition, for syscalls that implement the
+trampoline code on the vDSO, that trampoline is never intercepted.
+
+[selector] is a pointer to a char-sized region in the process memory
+region, that provides a quick way to enable disable syscall redirection
+thread-wide, without the need to invoke the kernel directly.  selector
+can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other
+value should terminate the program with a SIGSYS.
+
+Security Notes
+--------------
+
+Syscall User Dispatch provides functionality for compatibility layers to
+quickly capture system calls issued by a non-native part of the
+application, while not impacting the Linux native regions of the
+process.  It is not a mechanism for sandboxing system calls, and it
+should not be seen as a security mechanism, since it is trivial for a
+malicious application to subvert the mechanism by jumping to an allowed
+dispatcher region prior to executing the syscall, or to discover the
+address and modify the selector value.  If the use case requires any
+kind of security sandboxing, Seccomp should be used instead.
+
+Any fork or exec of the existing process resets the mechanism to
+PR_SYS_DISPATCH_OFF.