mbox series

[v1,0/4,RFC] Implement Trampoline File Descriptor

Message ID 20200728131050.24443-1-madvenka@linux.microsoft.com (mailing list archive)
Headers show
Series Implement Trampoline File Descriptor | expand

Message

Madhavan T. Venkataraman July 28, 2020, 1:10 p.m. UTC
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Introduction
------------

Trampolines are used in many different user applications. Trampoline
code is often generated at runtime. Trampoline code can also just be a
pre-defined sequence of machine instructions in a data buffer.

Trampoline code is placed either in a data page or in a stack page. In
order to execute a trampoline, the page it resides in needs to be mapped
with execute permissions. Writable pages with execute permissions provide
an attack surface for hackers. Attackers can use this to inject malicious
code, modify existing code or do other harm.

To mitigate this, LSMs such as SELinux may not allow pages to have both
write and execute permissions. This prevents trampolines from executing
and blocks applications that use trampolines. To allow genuine applications
to run, exceptions have to be made for them (by setting execmem, etc).
In this case, the attack surface is just the pages of such applications.

An application that is not allowed to have writable executable pages
may try to load trampoline code into a file and map the file with execute
permissions. In this case, the attack surface is just the buffer that
contains trampoline code. However, a successful exploit may provide the
hacker with means to load his own code in a file, map it and execute it.

LSMs (such as the IPE proposal [1]) may allow only properly signed object
files to be mapped with execute permissions. This will prevent trampoline
files from being mapped. Again, exceptions have to be made for genuine
applications.

We need a way to execute trampolines without making security exceptions
where possible and to reduce the attack surface even further.

Examples of trampolines
-----------------------

libffi (A Portable Foreign Function Interface Library):

libffi allows a user to define functions with an arbitrary list of
arguments and return value through a feature called "Closures".
Closures use trampolines to jump to ABI handlers that handle calling
conventions and call a target function. libffi is used by a lot
of different applications. To name a few:

	- Python
	- Java
	- Javascript
	- Ruby FFI
	- Lisp
	- Objective C

GCC nested functions:

GCC has traditionally used trampolines for implementing nested
functions. The trampoline is placed on the user stack. So, the stack
needs to be executable.

Currently available solution
----------------------------

One solution that has been proposed to allow trampolines to be executed
without making security exceptions is Trampoline Emulation. See:

https://pax.grsecurity.net/docs/emutramp.txt

In this solution, the kernel recognizes certain sequences of instructions
as "well-known" trampolines. When such a trampoline is executed, a page
fault happens because the trampoline page does not have execute permission.
The kernel recognizes the trampoline and emulates it. Basically, the
kernel does the work of the trampoline on behalf of the application.

Here, the attack surface is the buffer that contains the trampoline.
The attack surface is narrower than before. A hacker may still be able to
modify what gets loaded in the registers or modify the target PC to point
to arbitrary locations.

Currently, the emulated trampolines are the ones used in libffi and GCC
nested functions. To my knowledge, only X86 is supported at this time.

As noted in emutramp.txt, this is not a generic solution. For every new
trampoline that needs to be supported, new instruction sequences need to
be recognized by the kernel and emulated. And this has to be done for
every architecture that needs to be supported.

emutramp.txt notes the following:

"... the real solution is not in emulation but by designing a kernel API
for runtime code generation and modifying userland to make use of it."

Trampoline File Descriptor (trampfd)
--------------------------

I am proposing a kernel API using anonymous file descriptors that
can be used to create and execute trampolines with the help of the
kernel. In this solution also, the kernel does the work of the trampoline.
The API is described in patch 1/4 of this patchset. I provide a
summary here:

Trampolines commonly execute the following sequence:

	- Load some values in some registers and/or
	- Push some values on the stack
	- Jump to a target PC

libffi and GCC nested function trampolines fit into this model.

Using the kernel API, applications and libraries can:

	- Create a trampoline object
	- Associate a register context with the trampoline (including
	  a target PC)
	- Associate a stack context with the trampoline
	- Map the trampoline into a process address space
	- Execute the trampoline by executing at the trampoline address

The kernel creates the trampoline mapping without any permissions. When
the trampoline is executed by user code, a page fault happens and the
kernel gets control. The kernel recognizes that this is a trampoline
invocation. It sets up the user registers based on the specified
register context, and/or pushes values on the user stack based on the
specified stack context, and sets the user PC to the requested target
PC. When the kernel returns, execution continues at the target PC.
So, the kernel does the work of the trampoline on behalf of the
application.

In this case, the attack surface is the context buffer. A hacker may
attack an application with a vulnerability and may be able to modify the
context buffer. So, when the register or stack context is set for
a trampoline, the values may have been tampered with. From an attack
surface perspective, this is similar to Trampoline Emulation. But
with trampfd, user code can retrieve a trampoline's context from the
kernel and add defensive checks to see if the context has been
tampered with.

As for the target PC, trampfd implements a measure called the
"Allowed PCs" context (see Advantages) to prevent a hacker from making
the target PC point to arbitrary locations. So, the attack surface is
narrower than Trampoline Emulation.

Advantages of the Trampoline File Descriptor approach
-----------------------------------------------------

- trampfd is customizable. The user can specify any combination of
  allowed register name-value pairs in the register context and the kernel
  will set it up accordingly. This allows different user trampolines to be
  converted to use trampfd.

- trampfd allows a stack context to be set up so that trampolines that
  need to push values on the user stack can do that.

- The initial work is targeted for X86 and ARM. But the implementation
  leverages small portions of existing signal delivery code. Specifically,
  it uses pt_regs for setting up user registers and copy_to_user()
  to push values on the stack. So, this can be very easily ported to other
  architectures.

- trampfd provides a basic framework. In the future, new trampoline types
  can be implemented, new contexts can be defined, and additional rules
  can be implemented for security purposes.

- For instance, trampfd defines an "Allowed PCs" context in this initial
  work. As an example, libffi can create a read-only array of all ABI
  handlers for an architecture at build time. This array can be used to
  set the list of allowed PCs for a trampoline. This will mean that a hacker
  cannot hack the PC part of the register context and make it point to
  arbitrary locations.

- An SELinux setting called "exectramp" can be implemented along the
  lines of "execmem", "execstack" and "execheap" to selectively allow the
  use of trampolines on a per application basis.

- User code can add defensive checks in the code before invoking a
  trampoline to make sure that a hacker has not modified the context data.
  It can do this by getting the trampoline context from the kernel and
  double checking it.

- In the future, if the kernel can be enhanced to use a safe code
  generation component, that code can be placed in the trampoline mapping
  pages. Then, the trampoline invocation does not have to incur a trip
  into the kernel.

- Also, if the kernel can be enhanced to use a safe code generation
  component, other forms of dynamic code such as JIT code can be
  addressed by the trampfd framework.

- Trampolines can be shared across processes which can give rise to
  interesting uses in the future.

- Trampfd can be used for other purposes to extend the kernel's
  functionality.

libffi
------

I have implemented my solution for libffi and provided the changes for
X86 and ARM, 32-bit and 64-bit. Here is the reference patch:

http://linux.microsoft.com/~madvenka/libffi/libffi.txt

If the trampfd patchset gets accepted, I will send the libffi changes
to the maintainers for a review. BTW, I have also successfully executed
the libffi self tests.

Work that is pending
--------------------

- I am working on implementing an SELinux setting called "exectramp"
  similar to "execmem" to allow the use of trampfd on a per application
  basis.

- I have a comprehensive test program to test the kernel API. I am
  working on adding it to selftests.

References
----------

[1] https://microsoft.github.io/ipe/
---
Madhavan T. Venkataraman (4):
  fs/trampfd: Implement the trampoline file descriptor API
  x86/trampfd: Support for the trampoline file descriptor
  arm64/trampfd: Support for the trampoline file descriptor
  arm/trampfd: Support for the trampoline file descriptor

 arch/arm/include/uapi/asm/ptrace.h     |  20 ++
 arch/arm/kernel/Makefile               |   1 +
 arch/arm/kernel/trampfd.c              | 214 +++++++++++++++++
 arch/arm/mm/fault.c                    |  12 +-
 arch/arm/tools/syscall.tbl             |   1 +
 arch/arm64/include/asm/ptrace.h        |   9 +
 arch/arm64/include/asm/unistd.h        |   2 +-
 arch/arm64/include/asm/unistd32.h      |   2 +
 arch/arm64/include/uapi/asm/ptrace.h   |  57 +++++
 arch/arm64/kernel/Makefile             |   2 +
 arch/arm64/kernel/trampfd.c            | 278 ++++++++++++++++++++++
 arch/arm64/mm/fault.c                  |  15 +-
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 arch/x86/include/uapi/asm/ptrace.h     |  38 +++
 arch/x86/kernel/Makefile               |   2 +
 arch/x86/kernel/trampfd.c              | 313 +++++++++++++++++++++++++
 arch/x86/mm/fault.c                    |  11 +
 fs/Makefile                            |   1 +
 fs/trampfd/Makefile                    |   6 +
 fs/trampfd/trampfd_data.c              |  43 ++++
 fs/trampfd/trampfd_fops.c              | 131 +++++++++++
 fs/trampfd/trampfd_map.c               |  78 ++++++
 fs/trampfd/trampfd_pcs.c               |  95 ++++++++
 fs/trampfd/trampfd_regs.c              | 137 +++++++++++
 fs/trampfd/trampfd_stack.c             | 131 +++++++++++
 fs/trampfd/trampfd_stubs.c             |  41 ++++
 fs/trampfd/trampfd_syscall.c           |  92 ++++++++
 include/linux/syscalls.h               |   3 +
 include/linux/trampfd.h                |  82 +++++++
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/trampfd.h           | 171 ++++++++++++++
 init/Kconfig                           |   8 +
 kernel/sys_ni.c                        |   3 +
 34 files changed, 1998 insertions(+), 7 deletions(-)
 create mode 100644 arch/arm/kernel/trampfd.c
 create mode 100644 arch/arm64/kernel/trampfd.c
 create mode 100644 arch/x86/kernel/trampfd.c
 create mode 100644 fs/trampfd/Makefile
 create mode 100644 fs/trampfd/trampfd_data.c
 create mode 100644 fs/trampfd/trampfd_fops.c
 create mode 100644 fs/trampfd/trampfd_map.c
 create mode 100644 fs/trampfd/trampfd_pcs.c
 create mode 100644 fs/trampfd/trampfd_regs.c
 create mode 100644 fs/trampfd/trampfd_stack.c
 create mode 100644 fs/trampfd/trampfd_stubs.c
 create mode 100644 fs/trampfd/trampfd_syscall.c
 create mode 100644 include/linux/trampfd.h
 create mode 100644 include/uapi/linux/trampfd.h

Comments

David Laight July 28, 2020, 3:13 p.m. UTC | #1
From:  madvenka@linux.microsoft.com
> Sent: 28 July 2020 14:11
...
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.

Isn't the performance of this going to be horrid?

If you don't care that much about performance the fixup can
all be done in userspace within the fault signal handler.

Since whatever you do needs the application changed why
not change the implementation of nested functions to not
need on-stack executable trampolines.

I can think of other alternatives that don't need much more
than an array of 'push constant; jump trampoline' instructions
be created (all jump to the same place).

You might want something to create an executable page of such
instructions.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Casey Schaufler July 28, 2020, 4:05 p.m. UTC | #2
On 7/28/2020 6:10 AM, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>
> Introduction
> ------------
>
> Trampolines are used in many different user applications. Trampoline
> code is often generated at runtime. Trampoline code can also just be a
> pre-defined sequence of machine instructions in a data buffer.
>
> Trampoline code is placed either in a data page or in a stack page. In
> order to execute a trampoline, the page it resides in needs to be mapped
> with execute permissions. Writable pages with execute permissions provide
> an attack surface for hackers. Attackers can use this to inject malicious
> code, modify existing code or do other harm.
>
> To mitigate this, LSMs such as SELinux may not allow pages to have both
> write and execute permissions. This prevents trampolines from executing
> and blocks applications that use trampolines. To allow genuine applications
> to run, exceptions have to be made for them (by setting execmem, etc).
> In this case, the attack surface is just the pages of such applications.
>
> An application that is not allowed to have writable executable pages
> may try to load trampoline code into a file and map the file with execute
> permissions. In this case, the attack surface is just the buffer that
> contains trampoline code. However, a successful exploit may provide the
> hacker with means to load his own code in a file, map it and execute it.
>
> LSMs (such as the IPE proposal [1]) may allow only properly signed object
> files to be mapped with execute permissions. This will prevent trampoline
> files from being mapped. Again, exceptions have to be made for genuine
> applications.
>
> We need a way to execute trampolines without making security exceptions
> where possible and to reduce the attack surface even further.
>
> Examples of trampolines
> -----------------------
>
> libffi (A Portable Foreign Function Interface Library):
>
> libffi allows a user to define functions with an arbitrary list of
> arguments and return value through a feature called "Closures".
> Closures use trampolines to jump to ABI handlers that handle calling
> conventions and call a target function. libffi is used by a lot
> of different applications. To name a few:
>
> 	- Python
> 	- Java
> 	- Javascript
> 	- Ruby FFI
> 	- Lisp
> 	- Objective C
>
> GCC nested functions:
>
> GCC has traditionally used trampolines for implementing nested
> functions. The trampoline is placed on the user stack. So, the stack
> needs to be executable.
>
> Currently available solution
> ----------------------------
>
> One solution that has been proposed to allow trampolines to be executed
> without making security exceptions is Trampoline Emulation. See:
>
> https://pax.grsecurity.net/docs/emutramp.txt
>
> In this solution, the kernel recognizes certain sequences of instructions
> as "well-known" trampolines. When such a trampoline is executed, a page
> fault happens because the trampoline page does not have execute permission.
> The kernel recognizes the trampoline and emulates it. Basically, the
> kernel does the work of the trampoline on behalf of the application.

What prevents a malicious process from using the "well-known" trampoline
to its own purposes? I expect it is obvious, but I'm not seeing it. Old
eyes, I suppose.

> Here, the attack surface is the buffer that contains the trampoline.
> The attack surface is narrower than before. A hacker may still be able to
> modify what gets loaded in the registers or modify the target PC to point
> to arbitrary locations.
>
> Currently, the emulated trampolines are the ones used in libffi and GCC
> nested functions. To my knowledge, only X86 is supported at this time.
>
> As noted in emutramp.txt, this is not a generic solution. For every new
> trampoline that needs to be supported, new instruction sequences need to
> be recognized by the kernel and emulated. And this has to be done for
> every architecture that needs to be supported.
>
> emutramp.txt notes the following:
>
> "... the real solution is not in emulation but by designing a kernel API
> for runtime code generation and modifying userland to make use of it."
>
> Trampoline File Descriptor (trampfd)
> --------------------------
>
> I am proposing a kernel API using anonymous file descriptors that
> can be used to create and execute trampolines with the help of the
> kernel. In this solution also, the kernel does the work of the trampoline.
> The API is described in patch 1/4 of this patchset. I provide a
> summary here:
>
> Trampolines commonly execute the following sequence:
>
> 	- Load some values in some registers and/or
> 	- Push some values on the stack
> 	- Jump to a target PC
>
> libffi and GCC nested function trampolines fit into this model.
>
> Using the kernel API, applications and libraries can:
>
> 	- Create a trampoline object
> 	- Associate a register context with the trampoline (including
> 	  a target PC)
> 	- Associate a stack context with the trampoline
> 	- Map the trampoline into a process address space
> 	- Execute the trampoline by executing at the trampoline address
>
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
>
> In this case, the attack surface is the context buffer. A hacker may
> attack an application with a vulnerability and may be able to modify the
> context buffer. So, when the register or stack context is set for
> a trampoline, the values may have been tampered with. From an attack
> surface perspective, this is similar to Trampoline Emulation. But
> with trampfd, user code can retrieve a trampoline's context from the
> kernel and add defensive checks to see if the context has been
> tampered with.
>
> As for the target PC, trampfd implements a measure called the
> "Allowed PCs" context (see Advantages) to prevent a hacker from making
> the target PC point to arbitrary locations. So, the attack surface is
> narrower than Trampoline Emulation.
>
> Advantages of the Trampoline File Descriptor approach
> -----------------------------------------------------
>
> - trampfd is customizable. The user can specify any combination of
>   allowed register name-value pairs in the register context and the kernel
>   will set it up accordingly. This allows different user trampolines to be
>   converted to use trampfd.
>
> - trampfd allows a stack context to be set up so that trampolines that
>   need to push values on the user stack can do that.
>
> - The initial work is targeted for X86 and ARM. But the implementation
>   leverages small portions of existing signal delivery code. Specifically,
>   it uses pt_regs for setting up user registers and copy_to_user()
>   to push values on the stack. So, this can be very easily ported to other
>   architectures.
>
> - trampfd provides a basic framework. In the future, new trampoline types
>   can be implemented, new contexts can be defined, and additional rules
>   can be implemented for security purposes.
>
> - For instance, trampfd defines an "Allowed PCs" context in this initial
>   work. As an example, libffi can create a read-only array of all ABI
>   handlers for an architecture at build time. This array can be used to
>   set the list of allowed PCs for a trampoline. This will mean that a hacker
>   cannot hack the PC part of the register context and make it point to
>   arbitrary locations.
>
> - An SELinux setting called "exectramp" can be implemented along the
>   lines of "execmem", "execstack" and "execheap" to selectively allow the
>   use of trampolines on a per application basis.
>
> - User code can add defensive checks in the code before invoking a
>   trampoline to make sure that a hacker has not modified the context data.
>   It can do this by getting the trampoline context from the kernel and
>   double checking it.
>
> - In the future, if the kernel can be enhanced to use a safe code
>   generation component, that code can be placed in the trampoline mapping
>   pages. Then, the trampoline invocation does not have to incur a trip
>   into the kernel.
>
> - Also, if the kernel can be enhanced to use a safe code generation
>   component, other forms of dynamic code such as JIT code can be
>   addressed by the trampfd framework.
>
> - Trampolines can be shared across processes which can give rise to
>   interesting uses in the future.
>
> - Trampfd can be used for other purposes to extend the kernel's
>   functionality.
>
> libffi
> ------
>
> I have implemented my solution for libffi and provided the changes for
> X86 and ARM, 32-bit and 64-bit. Here is the reference patch:
>
> http://linux.microsoft.com/~madvenka/libffi/libffi.txt
>
> If the trampfd patchset gets accepted, I will send the libffi changes
> to the maintainers for a review. BTW, I have also successfully executed
> the libffi self tests.
>
> Work that is pending
> --------------------
>
> - I am working on implementing an SELinux setting called "exectramp"
>   similar to "execmem" to allow the use of trampfd on a per application
>   basis.

You could make a separate LSM to do these checks instead of limiting
it to SELinux. Your use case, your call, of course.

>
> - I have a comprehensive test program to test the kernel API. I am
>   working on adding it to selftests.
>
> References
> ----------
>
> [1] https://microsoft.github.io/ipe/
> ---
> Madhavan T. Venkataraman (4):
>   fs/trampfd: Implement the trampoline file descriptor API
>   x86/trampfd: Support for the trampoline file descriptor
>   arm64/trampfd: Support for the trampoline file descriptor
>   arm/trampfd: Support for the trampoline file descriptor
>
>  arch/arm/include/uapi/asm/ptrace.h     |  20 ++
>  arch/arm/kernel/Makefile               |   1 +
>  arch/arm/kernel/trampfd.c              | 214 +++++++++++++++++
>  arch/arm/mm/fault.c                    |  12 +-
>  arch/arm/tools/syscall.tbl             |   1 +
>  arch/arm64/include/asm/ptrace.h        |   9 +
>  arch/arm64/include/asm/unistd.h        |   2 +-
>  arch/arm64/include/asm/unistd32.h      |   2 +
>  arch/arm64/include/uapi/asm/ptrace.h   |  57 +++++
>  arch/arm64/kernel/Makefile             |   2 +
>  arch/arm64/kernel/trampfd.c            | 278 ++++++++++++++++++++++
>  arch/arm64/mm/fault.c                  |  15 +-
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  arch/x86/include/uapi/asm/ptrace.h     |  38 +++
>  arch/x86/kernel/Makefile               |   2 +
>  arch/x86/kernel/trampfd.c              | 313 +++++++++++++++++++++++++
>  arch/x86/mm/fault.c                    |  11 +
>  fs/Makefile                            |   1 +
>  fs/trampfd/Makefile                    |   6 +
>  fs/trampfd/trampfd_data.c              |  43 ++++
>  fs/trampfd/trampfd_fops.c              | 131 +++++++++++
>  fs/trampfd/trampfd_map.c               |  78 ++++++
>  fs/trampfd/trampfd_pcs.c               |  95 ++++++++
>  fs/trampfd/trampfd_regs.c              | 137 +++++++++++
>  fs/trampfd/trampfd_stack.c             | 131 +++++++++++
>  fs/trampfd/trampfd_stubs.c             |  41 ++++
>  fs/trampfd/trampfd_syscall.c           |  92 ++++++++
>  include/linux/syscalls.h               |   3 +
>  include/linux/trampfd.h                |  82 +++++++
>  include/uapi/asm-generic/unistd.h      |   4 +-
>  include/uapi/linux/trampfd.h           | 171 ++++++++++++++
>  init/Kconfig                           |   8 +
>  kernel/sys_ni.c                        |   3 +
>  34 files changed, 1998 insertions(+), 7 deletions(-)
>  create mode 100644 arch/arm/kernel/trampfd.c
>  create mode 100644 arch/arm64/kernel/trampfd.c
>  create mode 100644 arch/x86/kernel/trampfd.c
>  create mode 100644 fs/trampfd/Makefile
>  create mode 100644 fs/trampfd/trampfd_data.c
>  create mode 100644 fs/trampfd/trampfd_fops.c
>  create mode 100644 fs/trampfd/trampfd_map.c
>  create mode 100644 fs/trampfd/trampfd_pcs.c
>  create mode 100644 fs/trampfd/trampfd_regs.c
>  create mode 100644 fs/trampfd/trampfd_stack.c
>  create mode 100644 fs/trampfd/trampfd_stubs.c
>  create mode 100644 fs/trampfd/trampfd_syscall.c
>  create mode 100644 include/linux/trampfd.h
>  create mode 100644 include/uapi/linux/trampfd.h
>
Madhavan T. Venkataraman July 28, 2020, 4:32 p.m. UTC | #3
Thanks. See inline..

On 7/28/20 10:13 AM, David Laight wrote:
> From:  madvenka@linux.microsoft.com
>> Sent: 28 July 2020 14:11
> ...
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
> Isn't the performance of this going to be horrid?

It takes about the same amount of time as getpid(). So, it is
one quick trip into the kernel. I expect that applications will
typically not care about this extra overhead as long as
they are able to run.

But I agree that if there is an application that cannot tolerate
this extra overhead, then it is an issue. See below for further
discussion.

In the libffi changes I have included in the cover letter, I have
done it in such a way that trampfd is chosen when current
security settings don't allow other methods such as
loading trampoline code into a file and mapping it. In this
case, the application can at least run with trampfd.

>
> If you don't care that much about performance the fixup can
> all be done in userspace within the fault signal handler.

I do care about performance.

This is a framework to address trampolines. In this initial
work, I want to establish one basic way for things to work.
In the future, trampfd can be enhanced for performance.
For instance, it is easy for an architecture to generate
the exact instructions required to load specified registers,
push specified values on the stack and jump to a target
PC. The kernel can map a page with the generated code
with execute permissions. In this case, the performance
issue goes away.
> Since whatever you do needs the application changed why
> not change the implementation of nested functions to not
> need on-stack executable trampolines.

I kinda agree with your suggestion.

But it is up to the GCC folks to change its implementation.
I am trying to provide a way for their existing implementation
to work in a more secure way.
> I can think of other alternatives that don't need much more
> than an array of 'push constant; jump trampoline' instructions
> be created (all jump to the same place).
>
> You might want something to create an executable page of such
> instructions.

Agreed. And that can be done within this framework as
I have mentioned above.

But it is not just this trampoline type that I have implemented
in this patchset. In the future, other types can be implemented
and other contexts can be defined. Basically, the approach is
for the user to supply a recipe to the kernel and leave it up to
the kernel to do it in the best way possible. I am hoping that
other forms of dynamic code can be addressed in the future
using the same framework.

*Purely as a hypothetical example*, a user can supply
instructions in a language such as BPF that the kernel
understands and have the kernel arrange for that to
be executed in user context.

Madhavan

> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
Madhavan T. Venkataraman July 28, 2020, 4:49 p.m. UTC | #4
Thanks.

On 7/28/20 11:05 AM, Casey Schaufler wrote:
>> In this solution, the kernel recognizes certain sequences of instructions
>> as "well-known" trampolines. When such a trampoline is executed, a page
>> fault happens because the trampoline page does not have execute permission.
>> The kernel recognizes the trampoline and emulates it. Basically, the
>> kernel does the work of the trampoline on behalf of the application.
> What prevents a malicious process from using the "well-known" trampoline
> to its own purposes? I expect it is obvious, but I'm not seeing it. Old
> eyes, I suppose.

You are quite right. As I note below, the attack surface is the
buffer that contains the trampoline code. Since the kernel does
check the instruction sequence, the sequence cannot be
changed by a hacker. But the hacker can presumably change
the register values and redirect the PC to his desired location.

The assumption with trampoline emulation is that the
system will have security settings that will prevent pages from
having both write and execute permissions. So, a hacker
cannot load his own code in a page and redirect the PC to
it and execute his own code. But he can probably set the
PC to point to arbitrary locations. For instance, jump to
the middle of a C library function.
>
>> Here, the attack surface is the buffer that contains the trampoline.
>> The attack surface is narrower than before. A hacker may still be able to
>> modify what gets loaded in the registers or modify the target PC to point
>> to arbitrary locations.
...
>> Work that is pending
>> --------------------
>>
>> - I am working on implementing an SELinux setting called "exectramp"
>>   similar to "execmem" to allow the use of trampfd on a per application
>>   basis.
> You could make a separate LSM to do these checks instead of limiting
> it to SELinux. Your use case, your call, of course.

OK. I will research this.

Madhavan
James Morris July 28, 2020, 5:05 p.m. UTC | #5
On Tue, 28 Jul 2020, Casey Schaufler wrote:

> You could make a separate LSM to do these checks instead of limiting
> it to SELinux. Your use case, your call, of course.

It's not limited to SELinux. This is hooked via the LSM API and 
implementable by any LSM (similar to execmem, execstack etc.)
Madhavan T. Venkataraman July 28, 2020, 5:08 p.m. UTC | #6
On 7/28/20 12:05 PM, James Morris wrote:
> On Tue, 28 Jul 2020, Casey Schaufler wrote:
>
>> You could make a separate LSM to do these checks instead of limiting
>> it to SELinux. Your use case, your call, of course.
> It's not limited to SELinux. This is hooked via the LSM API and 
> implementable by any LSM (similar to execmem, execstack etc.)

Yes. I have an implementation that I am testing right now that
defines the hook for exectramp and implements it for
SELinux. That is why I mentioned SELinux.

Madhavan
Andy Lutomirski July 28, 2020, 5:16 p.m. UTC | #7
On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
> Thanks. See inline..
>
> On 7/28/20 10:13 AM, David Laight wrote:
> > From:  madvenka@linux.microsoft.com
> >> Sent: 28 July 2020 14:11
> > ...
> >> The kernel creates the trampoline mapping without any permissions. When
> >> the trampoline is executed by user code, a page fault happens and the
> >> kernel gets control. The kernel recognizes that this is a trampoline
> >> invocation. It sets up the user registers based on the specified
> >> register context, and/or pushes values on the user stack based on the
> >> specified stack context, and sets the user PC to the requested target
> >> PC. When the kernel returns, execution continues at the target PC.
> >> So, the kernel does the work of the trampoline on behalf of the
> >> application.
> > Isn't the performance of this going to be horrid?
>
> It takes about the same amount of time as getpid(). So, it is
> one quick trip into the kernel. I expect that applications will
> typically not care about this extra overhead as long as
> they are able to run.

What did you test this on?  A page fault on any modern x86_64 system
is much, much, much, much slower than a syscall.

--Andy
Andy Lutomirski July 28, 2020, 5:31 p.m. UTC | #8
> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>

> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.

This is quite clever, but now I’m wondering just how much kernel help
is really needed. In your series, the trampoline is an non-executable
page.  I can think of at least two alternative approaches, and I'd
like to know the pros and cons.

1. Entirely userspace: a return trampoline would be something like:

1:
pushq %rax
pushq %rbc
pushq %rcx
...
pushq %r15
movq %rsp, %rdi # pointer to saved regs
leaq 1b(%rip), %rsi # pointer to the trampoline itself
callq trampoline_handler # see below

You would fill a page with a bunch of these, possibly compacted to get
more per page, and then you would remap as many copies as needed.  The
'callq trampoline_handler' part would need to be a bit clever to make
it continue to work despite this remapping.  This will be *much*
faster than trampfd. How much of your use case would it cover?  For
the inverse, it's not too hard to write a bit of asm to set all
registers and jump somewhere.

2. Use existing kernel functionality.  Raise a signal, modify the
state, and return from the signal.  This is very flexible and may not
be all that much slower than trampfd.

3. Use a syscall.  Instead of having the kernel handle page faults,
have the trampoline code push the syscall nr register, load a special
new syscall nr into the syscall nr register, and do a syscall. On
x86_64, this would be:

pushq %rax
movq __NR_magic_trampoline, %rax
syscall

with some adjustment if the stack slot you're clobbering is important.


Also, will using trampfd cause issues with various unwinders?  I can
easily imagine unwinders expecting code to be readable, although this
is slowly going away for other reasons.

All this being said, I think that the kernel should absolutely add a
sensible interface for JITs to use to materialize their code.  This
would integrate sanely with LSMs and wouldn't require hacks like using
files, etc.  A cleverly designed JIT interface could function without
seriailization IPIs, and even lame architectures like x86 could
potentially avoid shootdown IPIs if the interface copied code instead
of playing virtual memory games.  At its very simplest, this could be:

void *jit_create_code(const void *source, size_t len);

and the result would be a new anonymous mapping that contains exactly
the code requested.  There could also be:

int jittfd_create(...);

that does something similar but creates a memfd.  A nicer
implementation for short JIT sequences would allow appending more code
to an existing JIT region.  On x86, an appendable JIT region would
start filled with 0xCC, and I bet there's a way to materialize new
code into a previously 0xcc-filled virtual page wthout any
synchronization.  One approach would be to start with:

<some code>
0xcc
0xcc
...
0xcc

and to create a whole new page like:

<some code>
<some more code>
0xcc
...
0xcc

so that the only difference is that some code changed to some more
code.  Then replace the PTE to swap from the old page to the new page,
and arrange to avoid freeing the old page until we're sure it's gone
from all TLBs.  This may not work if <some more code> spans a page
boundary.  The #BP fixup would zap the TLB and retry.  Even just
directly copying code over some 0xcc bytes almost works, but there's a
nasty corner case involving instructions that fetch I$ fetch
boundaries.  I'm not sure to what extent I$ snooping helps.

--Andy
Madhavan T. Venkataraman July 28, 2020, 6:52 p.m. UTC | #9
On 7/28/20 12:16 PM, Andy Lutomirski wrote:
> On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>> Thanks. See inline..
>>
>> On 7/28/20 10:13 AM, David Laight wrote:
>>> From:  madvenka@linux.microsoft.com
>>>> Sent: 28 July 2020 14:11
>>> ...
>>>> The kernel creates the trampoline mapping without any permissions. When
>>>> the trampoline is executed by user code, a page fault happens and the
>>>> kernel gets control. The kernel recognizes that this is a trampoline
>>>> invocation. It sets up the user registers based on the specified
>>>> register context, and/or pushes values on the user stack based on the
>>>> specified stack context, and sets the user PC to the requested target
>>>> PC. When the kernel returns, execution continues at the target PC.
>>>> So, the kernel does the work of the trampoline on behalf of the
>>>> application.
>>> Isn't the performance of this going to be horrid?
>> It takes about the same amount of time as getpid(). So, it is
>> one quick trip into the kernel. I expect that applications will
>> typically not care about this extra overhead as long as
>> they are able to run.
> What did you test this on?  A page fault on any modern x86_64 system
> is much, much, much, much slower than a syscall.

I sent a response to this. But the mail was returned to me.
I am resending.

I tested it in on a KVM guest running Ubuntu. So, when you say that a
page fault is much slower, do you mean a regular page fault that is handled
through the VM layer? Here is the relevant code in do_user_addr_fault():

        if (unlikely(access_error(hw_error_code, vma))) {
                /*                 
                 * If it is a user execute fault, it could be a trampoline
                 * invocation.
                 */
                if ((hw_error_code & tflags) == tflags &&
                     trampfd_fault(vma, regs)) {
                         up_read(&mm->mmap_sem);
                         return;
                 }
                 bad_area_access_error(regs, hw_error_code, address, vma);
                 return;
         }
         ...
         fault = handle_mm_fault(vma, address, flags);

trampfd faults are instruction faults that go through a different code path than
the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that
is time consuming. Could you clarify?

Thanks.

Madhavan
Madhavan T. Venkataraman July 28, 2020, 7:01 p.m. UTC | #10
I am working on a response to this. I will send it soon.

Thanks.

Madhavan

On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.  The
> 'callq trampoline_handler' part would need to be a bit clever to make
> it continue to work despite this remapping.  This will be *much*
> faster than trampfd. How much of your use case would it cover?  For
> the inverse, it's not too hard to write a bit of asm to set all
> registers and jump somewhere.
>
> 2. Use existing kernel functionality.  Raise a signal, modify the
> state, and return from the signal.  This is very flexible and may not
> be all that much slower than trampfd.
>
> 3. Use a syscall.  Instead of having the kernel handle page faults,
> have the trampoline code push the syscall nr register, load a special
> new syscall nr into the syscall nr register, and do a syscall. On
> x86_64, this would be:
>
> pushq %rax
> movq __NR_magic_trampoline, %rax
> syscall
>
> with some adjustment if the stack slot you're clobbering is important.
>
>
> Also, will using trampfd cause issues with various unwinders?  I can
> easily imagine unwinders expecting code to be readable, although this
> is slowly going away for other reasons.
>
> All this being said, I think that the kernel should absolutely add a
> sensible interface for JITs to use to materialize their code.  This
> would integrate sanely with LSMs and wouldn't require hacks like using
> files, etc.  A cleverly designed JIT interface could function without
> seriailization IPIs, and even lame architectures like x86 could
> potentially avoid shootdown IPIs if the interface copied code instead
> of playing virtual memory games.  At its very simplest, this could be:
>
> void *jit_create_code(const void *source, size_t len);
>
> and the result would be a new anonymous mapping that contains exactly
> the code requested.  There could also be:
>
> int jittfd_create(...);
>
> that does something similar but creates a memfd.  A nicer
> implementation for short JIT sequences would allow appending more code
> to an existing JIT region.  On x86, an appendable JIT region would
> start filled with 0xCC, and I bet there's a way to materialize new
> code into a previously 0xcc-filled virtual page wthout any
> synchronization.  One approach would be to start with:
>
> <some code>
> 0xcc
> 0xcc
> ...
> 0xcc
>
> and to create a whole new page like:
>
> <some code>
> <some more code>
> 0xcc
> ...
> 0xcc
>
> so that the only difference is that some code changed to some more
> code.  Then replace the PTE to swap from the old page to the new page,
> and arrange to avoid freeing the old page until we're sure it's gone
> from all TLBs.  This may not work if <some more code> spans a page
> boundary.  The #BP fixup would zap the TLB and retry.  Even just
> directly copying code over some 0xcc bytes almost works, but there's a
> nasty corner case involving instructions that fetch I$ fetch
> boundaries.  I'm not sure to what extent I$ snooping helps.
>
> --Andy
Andy Lutomirski July 29, 2020, 5:16 a.m. UTC | #11
On Tue, Jul 28, 2020 at 10:40 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
>
>
> On 7/28/20 12:16 PM, Andy Lutomirski wrote:
>
> On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>
> Thanks. See inline..
>
> On 7/28/20 10:13 AM, David Laight wrote:
>
> From:  madvenka@linux.microsoft.com
>
> Sent: 28 July 2020 14:11
>
> ...
>
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
>
> Isn't the performance of this going to be horrid?
>
> It takes about the same amount of time as getpid(). So, it is
> one quick trip into the kernel. I expect that applications will
> typically not care about this extra overhead as long as
> they are able to run.
>
> What did you test this on?  A page fault on any modern x86_64 system
> is much, much, much, much slower than a syscall.
>
>
> I tested it in on a KVM guest running Ubuntu. So, when you say
> that a page fault is much slower, do you mean a regular page
> fault that is handled through the VM layer? Here is the relevant code
> in do_user_addr_fault():

I mean that x86 CPUs have reasonably SYSCALL and SYSRET instructions
(the former is used for 64-bit system calls on Linux and the latter is
mostly used to return from system calls), but hardware page fault
delivery and IRET (used to return from page faults) are very slow.
David Laight July 29, 2020, 8:36 a.m. UTC | #12
From: Madhavan T. Venkataraman
> Sent: 28 July 2020 19:52
...
> trampfd faults are instruction faults that go through a different code path than
> the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that
> is time consuming. Could you clarify?

Given that the expectation is a few instructions in userspace
(eg to pick up the original arguments for a nested call)
the (probable) thousands of clocks taken by entering the
kernel (especially with page table separation) is a massive
delta.

If entering the kernel were cheap no one would have added
the DSO functions for getting the time of day.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Florian Weimer July 29, 2020, 1:29 p.m. UTC | #13
* Andy Lutomirski:

> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.

libffi does something like this for iOS, I believe.

The only thing you really need is a PC-relative indirect call, with the
target address loaded from a different page.  The trampoline handler can
do all the rest because it can identify the trampoline from the stack.
Having a closure parameter loaded into a register will speed things up,
of course.

I still hope to transition libffi to this model for most Linux targets.
It really simplifies things because you don't have to deal with cache
flushes (on both the data and code aliases for SELinux support).

But the key observation is that efficient trampolines do not need
run-time code generation at all because their code is so regular.

Thanks,
Florian
Madhavan T. Venkataraman July 29, 2020, 5:55 p.m. UTC | #14
On 7/29/20 3:36 AM, David Laight wrote:
> From: Madhavan T. Venkataraman
>> Sent: 28 July 2020 19:52
> ...
>> trampfd faults are instruction faults that go through a different code path than
>> the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that
>> is time consuming. Could you clarify?
> Given that the expectation is a few instructions in userspace
> (eg to pick up the original arguments for a nested call)
> the (probable) thousands of clocks taken by entering the
> kernel (especially with page table separation) is a massive
> delta.
>
> If entering the kernel were cheap no one would have added
> the DSO functions for getting the time of day.

I hear you. BTW, I did not say that the overhead was trivial.
I only said that in most cases, applications may not mind that
extra overhead.

However, since multiple people have raised that as an issue,
I will address it. I mentioned before that the kernel can actually
supply the code page that sets the context and jumps to
a PC and map it so the performance issue can be addressed.
I was planning to do that as a future enhancement.

If there is a consensus that I must address it immediately, I
could do that.

I will continue this discussion in my reply to Andy's email. Let
us pick it up from there.

Thanks.

Madhavan
>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
David Laight July 30, 2020, 1:09 p.m. UTC | #15
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
> 
> 1. Entirely userspace: a return trampoline would be something like:
> 
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below

For nested calls (where the trampoline needs to pass the
original stack frame to the nested function) I think you
just need a page full of:
	mov	$0, scratch_reg; jmp trampoline_handler
	mov	$1, scratch_reg; jmp trampoline_handler
You need an unused register, on x86-64 I think both
r10 and r11 are available.
On i386 I think eax can be used.
It might even be that the first argument register is
available - if that is used to pass in the stack frame.

The trampoline_handler then uses the passed in value
to index an array of stack frame and function pointers
and jumps to the real function.
You need to hold everything in __thread data.
And maybe be able to allocate an extra page for deeply
nested code paths (eg recursive nested functions).

You might then need a driver to create you a suitable
executable page. Somehow you need to pass in the address
of the trampoline_handler and the number for the first fault.
It need to pass back the 'stride' of the array and number
of elements created.

But if you can take the cost of the page fault, then
you can interpret the existing trampoline in userspace
within the signal handler.
This is two kernel entry/exits.

Arbitrary JIT is a different problem entirely.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Madhavan T. Venkataraman July 30, 2020, 2:42 p.m. UTC | #16
For some reason my email program is not delivering to all the
recipients because of some formatting issues. I am resending.
I apologize. I will try to get this fixed.

Sorry for the delay. I just needed to think about it a little.
I will respond to your first suggestion in this email. I will
respond to the others in separate emails if that is alright
with you.

On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.  The
> 'callq trampoline_handler' part would need to be a bit clever to make
> it continue to work despite this remapping.  This will be *much*
> faster than trampfd. How much of your use case would it cover?  For
> the inverse, it's not too hard to write a bit of asm to set all
> registers and jump somewhere.

Let me state my understanding of what you are suggesting. Correct me if
I get anything wrong. If you don't mind, I will also take the liberty
of generalizing and paraphrasing your suggestion.

The goal is to create two page mappings that are adjacent to each other:

- a code page that contains template code for a trampoline. Since the
  template code would tend to be small in size, pack as many of them
  as possible within a page to conserve memory. In other words, create
  an array of the template code fragments. Each element in the array
  would be used for one trampoline instance.

- a data page that contains an array of data elements. Corresponding
  to each code element in the code page, there would be a data element
  in the data page that would contain data that is specific to a
  trampoline instance.

- Code will access data using PC-relative addressing.

The management of the code pages and allocation for each trampoline
instance would all be done in user space.

Is this the general idea?

Creating a code page
----------------------------

We can do this in one of the following ways:
- Allocate a writable page at run time, write the template code into
  the page and have execute permissions on the page.

- Allocate a writable page at run time, write the template code into
  the page and remap the page with just execute permissions.

- Allocate a writable page at run time, write the template code into
  the page, write the page into a temporary file and map the file with
  execute permissions.

- Include the template code in a code page at build time itself and
  just remap the code page each time you need a code page.

Pros and Cons
-------------------

As long as the OS provides the functionality to do this and the security
subsystem in the OS allows the actions, this is totally feasible. If not,
we need something like trampfd.

As Floren mentioned, libffi does implement something like this for MACH.

In fact, in my libffi changes, I use trampfd only after all the other methods
have failed because of security settings.

But the above approach only solves the problem for this simple type of
trampoline. It does not provide a framework for addressing more complex types
or even other forms of dynamic code.

Also, each application would need to implement this solution for itself
as opposed to relying on one implementation provided by the kernel.

Trampfd-based solution
-------------------------------

I outlined an enhancement to trampfd in a response to David Laight. In this
enhancement, the kernel is the one that would set up the code page.

The kernel would call an arch-specific support function to generate the
code required to load registers, push values on the stack and jump to a PC
for a trampoline instance based on its current context. The trampoline
instance data could be baked into the code.

My initial idea was to only have one trampoline instance per page. But I
think I can implement multiple instances per page. I just have to manage
the trampfd file private data and VMA private data accordingly to map an
element in a code page to its trampoline object.

The two approaches are similar except for the detail about who sets up
and manages the trampoline pages. In both approaches, the performance problem
is addressed. But trampfd can be used even when security settings are
restrictive.

Is my solution acceptable?

A couple of things
------------------------

- In the current trampfd implementation, no physical pages are actually
  allocated. It is just a virtual mapping. From a memory footprint
  perspective, this is good. May be, we can let the user specify if
  he wants a fast trampoline that consumes memory or a slow one that doesn't?

- In the future, we may define additional types that need the kernel to do
  the job. Examples:

    - The kernel may have a trampoline type for which it is not willing
       or able to generate code

    - The kernel could emulate dynamic code for the user

     - The kernel could interpret dynamic code for the user

     - The kernel could allow the user to access some kernel functionality
        using the framework

  In such cases, there isn't any physical code page that gets mapped into
  the user address space. We need the kernel to handle the address fault
  and provide the functionality.

One question for the reviewers
----------------------------------------

Do you think that the file descriptor based approach is fine? Or, does this
need a regular system call based implementation? There are some advantages
with a regular system call:

- We don't consume file descriptors. E.g., in libffi, we have to
  keep the file descriptor open for a closure until the closure
  is freed.

- Trampoline operations can be performed based on the trampoline
  address instead of an fd.

- Sharing of objects across processes can be implemented through
  a regular ID based method rather than sending the file descriptor
  over a unix domain socket.

- Shared objects can be persistent.

- An fd based API does structure parsing in read()/write() calls
  to obtain arguments. With a regular system call, that is not
  necessary.

Please let me know your thoughts.

Madhavan
Andy Lutomirski July 30, 2020, 8:54 p.m. UTC | #17
On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
> Sorry for the delay. I just wanted to think about this a little.
> In this email, I will respond to your first suggestion. I will
> respond to the rest in separate emails if that is alright with
> you.
>
> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>
> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
>
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.  The
> 'callq trampoline_handler' part would need to be a bit clever to make
> it continue to work despite this remapping.  This will be *much*
> faster than trampfd. How much of your use case would it cover?  For
> the inverse, it's not too hard to write a bit of asm to set all
> registers and jump somewhere.
>
> Let me state what I have understood about this suggestion. Correct me if
> I get anything wrong. If you don't mind, I will also take the liberty
> of generalizing and paraphrasing your suggestion.
>
> The goal is to create two page mappings that are adjacent to each other:
>
> - a code page that contains template code for a trampoline. Since the
>  template code would tend to be small in size, pack as many of them
>  as possible within a page to conserve memory. In other words, create
>  an array of the template code fragments. Each element in the array
>  would be used for one trampoline instance.
>
> - a data page that contains an array of data elements. Corresponding
>  to each code element in the code page, there would be a data element
>  in the data page that would contain data that is specific to a
>  trampoline instance.
>
> - Code will access data using PC-relative addressing.
>
> The management of the code pages and allocation for each trampoline
> instance would all be done in user space.
>
> Is this the general idea?

Yes.

>
> Creating a code page
> --------------------
>
> We can do this in one of the following ways:
>
> - Allocate a writable page at run time, write the template code into
>   the page and have execute permissions on the page.
>
> - Allocate a writable page at run time, write the template code into
>   the page and remap the page with just execute permissions.
>
> - Allocate a writable page at run time, write the template code into
>   the page, write the page into a temporary file and map the file with
>   execute permissions.
>
> - Include the template code in a code page at build time itself and
>   just remap the code page each time you need a code page.

This latter part shouldn't need any special permissions as far as I know.

>
> Pros and Cons
> -------------
>
> As long as the OS provides the functionality to do this and the security
> subsystem in the OS allows the actions, this is totally feasible. If not,
> we need something like trampfd.
>
> As Floren mentioned, libffi does implement something like this for MACH.
>
> In fact, in my libffi changes, I use trampfd only after all the other methods
> have failed because of security settings.
>
> But the above approach only solves the problem for this simple type of
> trampoline. It does not provide a framework for addressing more complex types
> or even other forms of dynamic code.
>
> Also, each application would need to implement this solution for itself
> as opposed to relying on one implementation provided by the kernel.

I would argue this is a benefit.  If the whole implementation is in
userspace, there is no ABI compatibility issue.  The user program
contains the trampoline code and the code that uses it.

>
> Trampfd-based solution
> ----------------------
>
> I outlined an enhancement to trampfd in a response to David Laight. In this
> enhancement, the kernel is the one that would set up the code page.
>
> The kernel would call an arch-specific support function to generate the
> code required to load registers, push values on the stack and jump to a PC
> for a trampoline instance based on its current context. The trampoline
> instance data could be baked into the code.
>
> My initial idea was to only have one trampoline instance per page. But I
> think I can implement multiple instances per page. I just have to manage
> the trampfd file private data and VMA private data accordingly to map an
> element in a code page to its trampoline object.
>
> The two approaches are similar except for the detail about who sets up
> and manages the trampoline pages. In both approaches, the performance problem
> is addressed. But trampfd can be used even when security settings are
> restrictive.
>
> Is my solution acceptable?

Perhaps.  In general, before adding a new ABI to the kernel, it's nice
to understand how it's better than doing the same thing in userspace.
Saying that it's easier for user code to work with if it's in the
kernel isn't necessarily an adequate justification.

Why would remapping two pages of actual application text ever fail?
Madhavan T. Venkataraman July 31, 2020, 5:13 p.m. UTC | #18
On 7/30/20 3:54 PM, Andy Lutomirski wrote:
> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>> ...
>> Creating a code page
>> --------------------
>>
>> We can do this in one of the following ways:
>>
>> - Allocate a writable page at run time, write the template code into
>>   the page and have execute permissions on the page.
>>
>> - Allocate a writable page at run time, write the template code into
>>   the page and remap the page with just execute permissions.
>>
>> - Allocate a writable page at run time, write the template code into
>>   the page, write the page into a temporary file and map the file with
>>   execute permissions.
>>
>> - Include the template code in a code page at build time itself and
>>   just remap the code page each time you need a code page.
> This latter part shouldn't need any special permissions as far as I know.

Agreed.
>
>> Pros and Cons
>> -------------
>>
>> As long as the OS provides the functionality to do this and the security
>> subsystem in the OS allows the actions, this is totally feasible. If not,
>> we need something like trampfd.
>>
>> As Floren mentioned, libffi does implement something like this for MACH.
>>
>> In fact, in my libffi changes, I use trampfd only after all the other methods
>> have failed because of security settings.
>>
>> But the above approach only solves the problem for this simple type of
>> trampoline. It does not provide a framework for addressing more complex types
>> or even other forms of dynamic code.
>>
>> Also, each application would need to implement this solution for itself
>> as opposed to relying on one implementation provided by the kernel.
> I would argue this is a benefit.  If the whole implementation is in
> userspace, there is no ABI compatibility issue.  The user program
> contains the trampoline code and the code that uses it.

The current trampfd implementation also does not have an ABI issue.
ABI details are to be handled in user land. In the case of libffi, they
are. Trampfd only addresses the trampoline required to jump to the
ABI handler.

>
>> Trampfd-based solution
>> ----------------------
>>
>> I outlined an enhancement to trampfd in a response to David Laight. In this
>> enhancement, the kernel is the one that would set up the code page.
>>
>> The kernel would call an arch-specific support function to generate the
>> code required to load registers, push values on the stack and jump to a PC
>> for a trampoline instance based on its current context. The trampoline
>> instance data could be baked into the code.
>>
>> My initial idea was to only have one trampoline instance per page. But I
>> think I can implement multiple instances per page. I just have to manage
>> the trampfd file private data and VMA private data accordingly to map an
>> element in a code page to its trampoline object.
>>
>> The two approaches are similar except for the detail about who sets up
>> and manages the trampoline pages. In both approaches, the performance problem
>> is addressed. But trampfd can be used even when security settings are
>> restrictive.
>>
>> Is my solution acceptable?
> Perhaps.  In general, before adding a new ABI to the kernel, it's nice
> to understand how it's better than doing the same thing in userspace.
> Saying that it's easier for user code to work with if it's in the
> kernel isn't necessarily an adequate justification.

Fair enough.

Dealing with multiple architectures
-----------------------------------------------

One good reason to use trampfd is multiple architecture support. The
trampoline table in a code page approach is neat. I don't deny that at
all. But my question is - can it be used in all cases?

It requires PC-relative data references. I have not worked on all architectures.
So, I need to study this. But do all ISAs support PC-relative data references?

Even in an ISA that supports it, there would be a maximum supported offset
from the current PC that can be reached for a data reference. That maximum
needs to be at least the size of a base page in the architecture. This is because
the code page and the data page need to be separate for security reasons.
Do all ISAs support a sufficiently large offset?

When the kernel generates the code for a trampoline, it can hard code data values
in the generated code itself so it does not need PC-relative data referencing.

And, for ISAs that do support the large offset, we do have to implement and
maintain the code page stuff for different ISAs for each application and library
if we did not use trampfd.

If you look at the libffi reference patch that I have linked in the cover letter,
I have added functions in common code that wrap trampfd calls. From architecture
specific code, there is just one function call to one of those wrapper functions
to set the register context for the trampoline. This is a very small C code change
in each architecture. So, support can be extended to all architectures without
exception easily.

Runtime generated trampolines
-------------------------------------------

libffi trampolines are simple. But there may be many cases out there where the
trampoline code cannot be statically defined at build time. It may have to be
generated at runtime. For this, we will need trampfd.

Security
-----------

With the user level trampoline table approach, the data part of the trampoline table
can be hacked by an attacker if an application has a vulnerability. Specifically, the
target PC can be altered to some arbitrary location. Trampfd implements an
"Allowed PCS" context. In the libffi changes, I have created a read-only array of
all ABI handlers used in closures for each architecture. This read-only array
can be used to restrict the PC values for libffi trampolines to prevent hacking.

To generalize, we can implement security rules/features if the trampoline
object is in the kernel.

Standardization
---------------------

Trampfd is a framework that can be used to implement multiple things. May be,
a few of those things can also be implemented in user land itself. But I think having
just one mechanism to execute dynamic code objects is preferable to having
multiple mechanisms not standardized across all applications.

As an example, let us say that I am able to implement support for JIT code. Let us
say that an interpreter uses libffi to execute a generated function. The interpreter
would use trampfd for the JIT code object and get an address. Then, it would pass
that to libffi which would then use trampfd for the trampoline. So, trampfd based
code objects can be chained.

> Why would remapping two pages of actual application text ever fail?

Remapping a page may not be available on all OSes. However, that is not a problem
for the code page approach. One can always memory map the code page from the
binary file directly. So, yes, this would not fail.

Madhavan
Mark Rutland July 31, 2020, 6:09 p.m. UTC | #19
Hi,

On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> Trampoline code is placed either in a data page or in a stack page. In
> order to execute a trampoline, the page it resides in needs to be mapped
> with execute permissions. Writable pages with execute permissions provide
> an attack surface for hackers. Attackers can use this to inject malicious
> code, modify existing code or do other harm.

For the purpose of below, IIUC this assumes the adversary has an
arbitrary write.

> To mitigate this, LSMs such as SELinux may not allow pages to have both
> write and execute permissions. This prevents trampolines from executing
> and blocks applications that use trampolines. To allow genuine applications
> to run, exceptions have to be made for them (by setting execmem, etc).
> In this case, the attack surface is just the pages of such applications.
> 
> An application that is not allowed to have writable executable pages
> may try to load trampoline code into a file and map the file with execute
> permissions. In this case, the attack surface is just the buffer that
> contains trampoline code. However, a successful exploit may provide the
> hacker with means to load his own code in a file, map it and execute it.

It's not clear to me what power the adversary is assumed to have here,
and consequently it's not clear to me how the proposal mitigates this.

For example, if the attack can control the arguments to syscalls, and
has an arbitrary write as above, what prevents them from creating a
trampfd of their own?

[...]

> GCC has traditionally used trampolines for implementing nested
> functions. The trampoline is placed on the user stack. So, the stack
> needs to be executable.

IIUC generally nested functions are avoided these days, specifically to
prevent the creation of gadgets on the stack. So I don't think those are
relevant as a cased to care about. Applications using them should move
to not using them, and would be more secure generally for doing so.

[...]

> Trampoline File Descriptor (trampfd)
> --------------------------
> 
> I am proposing a kernel API using anonymous file descriptors that
> can be used to create and execute trampolines with the help of the
> kernel. In this solution also, the kernel does the work of the trampoline.

What's the rationale for the kernel emulating the trampoline here?

In ther case of EMUTRAMP this was necessary to work with existing
application binaries and kernel ABIs which placed instructions onto the
stack, and the stack needed to remain RW for other reasons. That
restriction doesn't apply here.

Assuming trampfd creation is somehow authenticated, the code could be
placed in a r-x page (which the kernel could refuse to add write
permission), in order to prevent modification. If that's sufficient,
it's not much of a leap to allow userspace to generate the code.

> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
> 
> In this case, the attack surface is the context buffer. A hacker may
> attack an application with a vulnerability and may be able to modify the
> context buffer. So, when the register or stack context is set for
> a trampoline, the values may have been tampered with. From an attack
> surface perspective, this is similar to Trampoline Emulation. But
> with trampfd, user code can retrieve a trampoline's context from the
> kernel and add defensive checks to see if the context has been
> tampered with.

Can you elaborate on this: what sort of checks would be applied, and
how?

Why is this not possible in a r-x user page?

[...]

> - trampfd provides a basic framework. In the future, new trampoline types
>   can be implemented, new contexts can be defined, and additional rules
>   can be implemented for security purposes.

From a kernel developer perspective, this reads as "this ABI will become
more complex", which I think is worrisome.

I'm also worried that this is liable to have nasty interaction with HW
CFI mechanisms (e.g. PAC+BTI on arm64) either now or in future, and that
we bake incompatibility into ABI.

> - For instance, trampfd defines an "Allowed PCs" context in this initial
>   work. As an example, libffi can create a read-only array of all ABI
>   handlers for an architecture at build time. This array can be used to
>   set the list of allowed PCs for a trampoline. This will mean that a hacker
>   cannot hack the PC part of the register context and make it point to
>   arbitrary locations.

I'm not exactly sure what's meant here. Do you mean that this prevents
userspace from branching into the middle of a trampoline, or that the
trampfd code prevents where the trampoline itself can branch to?

Both x86 and arm64 have upcoming HW CFI (CET and BTI) to deal with the
former, and I believe the latter can also be implemented in userspace
with defensive checks in the trampolines, provided that they are
protected read-only.

> - An SELinux setting called "exectramp" can be implemented along the
>   lines of "execmem", "execstack" and "execheap" to selectively allow the
>   use of trampolines on a per application basis.
> 
> - User code can add defensive checks in the code before invoking a
>   trampoline to make sure that a hacker has not modified the context data.
>   It can do this by getting the trampoline context from the kernel and
>   double checking it.

As above, without examples it's not clear to me what sort of chacks are
possible nor where they wouild need to be made. So it's difficult to see
whether that's actually possible or subject to TOCTTOU races and
similar.

> - In the future, if the kernel can be enhanced to use a safe code
>   generation component, that code can be placed in the trampoline mapping
>   pages. Then, the trampoline invocation does not have to incur a trip
>   into the kernel.
> 
> - Also, if the kernel can be enhanced to use a safe code generation
>   component, other forms of dynamic code such as JIT code can be
>   addressed by the trampfd framework.

I don't see why it's necessary for the kernel to generate code at all.
If the trampfd creation requests can be trusted, what prevents trusting
a sealed set of instructions generated in userspace?

> - Trampolines can be shared across processes which can give rise to
>   interesting uses in the future.

This sounds like the use-case of a sealed memfd. Is a sealed executable
memfd not sufficient?

Thanks,
Mark.
Mark Rutland July 31, 2020, 6:31 p.m. UTC | #20
On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
> > On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
> > <madvenka@linux.microsoft.com> wrote:
> Dealing with multiple architectures
> -----------------------------------------------
> 
> One good reason to use trampfd is multiple architecture support. The
> trampoline table in a code page approach is neat. I don't deny that at
> all. But my question is - can it be used in all cases?
> 
> It requires PC-relative data references. I have not worked on all architectures.
> So, I need to study this. But do all ISAs support PC-relative data references?

Not all do, but pretty much any recent ISA will as it's a practical
necessity for fast position-independent code.

> Even in an ISA that supports it, there would be a maximum supported offset
> from the current PC that can be reached for a data reference. That maximum
> needs to be at least the size of a base page in the architecture. This is because
> the code page and the data page need to be separate for security reasons.
> Do all ISAs support a sufficiently large offset?

ISAs with pc-relative addessing can usually generate PC-relative
addresses into a GPR, from which they can apply an arbitrarily large
offset.

> When the kernel generates the code for a trampoline, it can hard code data values
> in the generated code itself so it does not need PC-relative data referencing.
> 
> And, for ISAs that do support the large offset, we do have to implement and
> maintain the code page stuff for different ISAs for each application and library
> if we did not use trampfd.

Trampoline code is architecture specific today, so I don't see that as a
major issue. Common structural bits can probably be shared even if the
specifid machine code cannot.

[...]

> Security
> -----------
> 
> With the user level trampoline table approach, the data part of the trampoline table
> can be hacked by an attacker if an application has a vulnerability. Specifically, the
> target PC can be altered to some arbitrary location. Trampfd implements an
> "Allowed PCS" context. In the libffi changes, I have created a read-only array of
> all ABI handlers used in closures for each architecture. This read-only array
> can be used to restrict the PC values for libffi trampolines to prevent hacking.
> 
> To generalize, we can implement security rules/features if the trampoline
> object is in the kernel.

I don't follow this argument. If it's possible to statically define that
in the kernel, it's also possible to do that in userspace without any
new kernel support.

[...]

> Trampfd is a framework that can be used to implement multiple things. May be,
> a few of those things can also be implemented in user land itself. But I think having
> just one mechanism to execute dynamic code objects is preferable to having
> multiple mechanisms not standardized across all applications.

In abstract, having a common interface sounds nice, but in practice
elements of this are always architecture-specific (e.g. interactiosn
with HW CFI), and that common interface can result in more pain as it
doesn't fit naturally into the context that ISAs were designed for (e.g. 
where control-flow instructions are extended with new semantics).

It also meass that you can't share the rough approach across OSs which
do not implement an identical mechanism, so for code abstracting by ISA
first, then by platform/ABI, there isn't much saving.

Thanks,
Mark.
Madhavan T. Venkataraman July 31, 2020, 8:08 p.m. UTC | #21
Thanks for the comments. I will respond to these and your next
email on Monday.

Madhavan

On 7/31/20 1:09 PM, Mark Rutland wrote:
> Hi,
>
> On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>> Trampoline code is placed either in a data page or in a stack page. In
>> order to execute a trampoline, the page it resides in needs to be mapped
>> with execute permissions. Writable pages with execute permissions provide
>> an attack surface for hackers. Attackers can use this to inject malicious
>> code, modify existing code or do other harm.
> For the purpose of below, IIUC this assumes the adversary has an
> arbitrary write.
>
>> To mitigate this, LSMs such as SELinux may not allow pages to have both
>> write and execute permissions. This prevents trampolines from executing
>> and blocks applications that use trampolines. To allow genuine applications
>> to run, exceptions have to be made for them (by setting execmem, etc).
>> In this case, the attack surface is just the pages of such applications.
>>
>> An application that is not allowed to have writable executable pages
>> may try to load trampoline code into a file and map the file with execute
>> permissions. In this case, the attack surface is just the buffer that
>> contains trampoline code. However, a successful exploit may provide the
>> hacker with means to load his own code in a file, map it and execute it.
> It's not clear to me what power the adversary is assumed to have here,
> and consequently it's not clear to me how the proposal mitigates this.
>
> For example, if the attack can control the arguments to syscalls, and
> has an arbitrary write as above, what prevents them from creating a
> trampfd of their own?
>
> [...]
>
>> GCC has traditionally used trampolines for implementing nested
>> functions. The trampoline is placed on the user stack. So, the stack
>> needs to be executable.
> IIUC generally nested functions are avoided these days, specifically to
> prevent the creation of gadgets on the stack. So I don't think those are
> relevant as a cased to care about. Applications using them should move
> to not using them, and would be more secure generally for doing so.
>
> [...]
>
>> Trampoline File Descriptor (trampfd)
>> --------------------------
>>
>> I am proposing a kernel API using anonymous file descriptors that
>> can be used to create and execute trampolines with the help of the
>> kernel. In this solution also, the kernel does the work of the trampoline.
> What's the rationale for the kernel emulating the trampoline here?
>
> In ther case of EMUTRAMP this was necessary to work with existing
> application binaries and kernel ABIs which placed instructions onto the
> stack, and the stack needed to remain RW for other reasons. That
> restriction doesn't apply here.
>
> Assuming trampfd creation is somehow authenticated, the code could be
> placed in a r-x page (which the kernel could refuse to add write
> permission), in order to prevent modification. If that's sufficient,
> it's not much of a leap to allow userspace to generate the code.
>
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
>>
>> In this case, the attack surface is the context buffer. A hacker may
>> attack an application with a vulnerability and may be able to modify the
>> context buffer. So, when the register or stack context is set for
>> a trampoline, the values may have been tampered with. From an attack
>> surface perspective, this is similar to Trampoline Emulation. But
>> with trampfd, user code can retrieve a trampoline's context from the
>> kernel and add defensive checks to see if the context has been
>> tampered with.
> Can you elaborate on this: what sort of checks would be applied, and
> how?
>
> Why is this not possible in a r-x user page?
>
> [...]
>
>> - trampfd provides a basic framework. In the future, new trampoline types
>>   can be implemented, new contexts can be defined, and additional rules
>>   can be implemented for security purposes.
> >From a kernel developer perspective, this reads as "this ABI will become
> more complex", which I think is worrisome.
>
> I'm also worried that this is liable to have nasty interaction with HW
> CFI mechanisms (e.g. PAC+BTI on arm64) either now or in future, and that
> we bake incompatibility into ABI.
>
>> - For instance, trampfd defines an "Allowed PCs" context in this initial
>>   work. As an example, libffi can create a read-only array of all ABI
>>   handlers for an architecture at build time. This array can be used to
>>   set the list of allowed PCs for a trampoline. This will mean that a hacker
>>   cannot hack the PC part of the register context and make it point to
>>   arbitrary locations.
> I'm not exactly sure what's meant here. Do you mean that this prevents
> userspace from branching into the middle of a trampoline, or that the
> trampfd code prevents where the trampoline itself can branch to?
>
> Both x86 and arm64 have upcoming HW CFI (CET and BTI) to deal with the
> former, and I believe the latter can also be implemented in userspace
> with defensive checks in the trampolines, provided that they are
> protected read-only.
>
>> - An SELinux setting called "exectramp" can be implemented along the
>>   lines of "execmem", "execstack" and "execheap" to selectively allow the
>>   use of trampolines on a per application basis.
>>
>> - User code can add defensive checks in the code before invoking a
>>   trampoline to make sure that a hacker has not modified the context data.
>>   It can do this by getting the trampoline context from the kernel and
>>   double checking it.
> As above, without examples it's not clear to me what sort of chacks are
> possible nor where they wouild need to be made. So it's difficult to see
> whether that's actually possible or subject to TOCTTOU races and
> similar.
>
>> - In the future, if the kernel can be enhanced to use a safe code
>>   generation component, that code can be placed in the trampoline mapping
>>   pages. Then, the trampoline invocation does not have to incur a trip
>>   into the kernel.
>>
>> - Also, if the kernel can be enhanced to use a safe code generation
>>   component, other forms of dynamic code such as JIT code can be
>>   addressed by the trampfd framework.
> I don't see why it's necessary for the kernel to generate code at all.
> If the trampfd creation requests can be trusted, what prevents trusting
> a sealed set of instructions generated in userspace?
>
>> - Trampolines can be shared across processes which can give rise to
>>   interesting uses in the future.
> This sounds like the use-case of a sealed memfd. Is a sealed executable
> memfd not sufficient?
>
> Thanks,
> Mark.
Pavel Machek Aug. 2, 2020, 11:56 a.m. UTC | #22
Hi!

> > This is quite clever, but now I???m wondering just how much kernel help
> > is really needed. In your series, the trampoline is an non-executable
> > page.  I can think of at least two alternative approaches, and I'd
> > like to know the pros and cons.
> > 
> > 1. Entirely userspace: a return trampoline would be something like:
> > 
> > 1:
> > pushq %rax
> > pushq %rbc
> > pushq %rcx
> > ...
> > pushq %r15
> > movq %rsp, %rdi # pointer to saved regs
> > leaq 1b(%rip), %rsi # pointer to the trampoline itself
> > callq trampoline_handler # see below
> 
> For nested calls (where the trampoline needs to pass the
> original stack frame to the nested function) I think you
> just need a page full of:
> 	mov	$0, scratch_reg; jmp trampoline_handler

I believe you could do with mov %pc, scratch_reg; jmp ...

That has advantage of being able to share single physical page across multiple virtual pages...


									Pavel
Florian Weimer Aug. 2, 2020, 1:57 p.m. UTC | #23
* Madhavan T. Venkataraman:

> Standardization
> ---------------------
>
> Trampfd is a framework that can be used to implement multiple
> things. May be, a few of those things can also be implemented in
> user land itself. But I think having just one mechanism to execute
> dynamic code objects is preferable to having multiple mechanisms not
> standardized across all applications.
>
> As an example, let us say that I am able to implement support for
> JIT code. Let us say that an interpreter uses libffi to execute a
> generated function. The interpreter would use trampfd for the JIT
> code object and get an address. Then, it would pass that to libffi
> which would then use trampfd for the trampoline. So, trampfd based
> code objects can be chained.

There is certainly value in coordination.  For example, it would be
nice if unwinders could recognize the trampolines during all phases
and unwind correctly through them (including when interrupted by an
asynchronous symbol).  That requires some level of coordination with
the unwinder and dynamic linker.

A kernel solution could hide the intermediate state in a kernel-side
trap handler, but I think it wouldn't reduce the overall complexity.
Madhavan T. Venkataraman Aug. 2, 2020, 6:54 p.m. UTC | #24
More responses inline..

On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>
> 2. Use existing kernel functionality.  Raise a signal, modify the
> state, and return from the signal.  This is very flexible and may not
> be all that much slower than trampfd.

Let me understand this. You are saying that the trampoline code
would raise a signal and, in the signal handler, set up the context
so that when the signal handler returns, we end up in the target
function with the context correctly set up. And, this trampoline code
can be generated statically at build time so that there are no
security issues using it.

Have I understood your suggestion correctly?

So, my argument would be that this would always incur the overhead
of a trip to the kernel. I think twice the overhead if I am not mistaken.
With trampfd, we can have the kernel generate the code so that there
is no performance penalty at all.

Signals have many problems. Which signal number should we use for this
purpose? If we use an existing one, that might conflict with what the application
is already handling. Getting a new signal number for this could meet
with resistance from the community.

Also, signals are asynchronous. So, they are vulnerable to race conditions.
To prevent other signals from coming in while handling the raised signal,
we would need to block and unblock signals. This will cause more
overhead.

> 3. Use a syscall.  Instead of having the kernel handle page faults,
> have the trampoline code push the syscall nr register, load a special
> new syscall nr into the syscall nr register, and do a syscall. On
> x86_64, this would be:
>
> pushq %rax
> movq __NR_magic_trampoline, %rax
> syscall
>
> with some adjustment if the stack slot you're clobbering is important.

How is this better than the kernel handling an address fault?
The system call still needs to do the same work as the fault handler.
We do need to specify the register and stack contexts before hand
so the system call can do its job.

Also, this always incurs a trip to the kernel. With trampfd, the kernel
could generate the code to avoid the performance penalty.


>
> Also, will using trampfd cause issues with various unwinders?  I can
> easily imagine unwinders expecting code to be readable, although this
> is slowly going away for other reasons.

I need to study unwinders a little before I respond to this question.
So, bear with me.

> All this being said, I think that the kernel should absolutely add a
> sensible interface for JITs to use to materialize their code.  This
> would integrate sanely with LSMs and wouldn't require hacks like using
> files, etc.  A cleverly designed JIT interface could function without
> seriailization IPIs, and even lame architectures like x86 could
> potentially avoid shootdown IPIs if the interface copied code instead
> of playing virtual memory games.  At its very simplest, this could be:
>
> void *jit_create_code(const void *source, size_t len);
>
> and the result would be a new anonymous mapping that contains exactly
> the code requested.  There could also be:
>
> int jittfd_create(...);
>
> that does something similar but creates a memfd.  A nicer
> implementation for short JIT sequences would allow appending more code
> to an existing JIT region.  On x86, an appendable JIT region would
> start filled with 0xCC, and I bet there's a way to materialize new
> code into a previously 0xcc-filled virtual page wthout any
> synchronization.  One approach would be to start with:
>
> <some code>
> 0xcc
> 0xcc
> ...
> 0xcc
>
> and to create a whole new page like:
>
> <some code>
> <some more code>
> 0xcc
> ...
> 0xcc
>
> so that the only difference is that some code changed to some more
> code.  Then replace the PTE to swap from the old page to the new page,
> and arrange to avoid freeing the old page until we're sure it's gone
> from all TLBs.  This may not work if <some more code> spans a page
> boundary.  The #BP fixup would zap the TLB and retry.  Even just
> directly copying code over some 0xcc bytes almost works, but there's a
> nasty corner case involving instructions that fetch I$ fetch
> boundaries.  I'm not sure to what extent I$ snooping helps.

I am thinking that the trampfd API can be used for addressing JIT
code as well. I have not yet started thinking about the details. But I
think the API is sufficient. E.g.,

    struct trampfd_jit {
        void    *source;
        size_t    len;
    };

    struct trampfd_jit    jit;
    struct trampfd_map    map;
    void    *addr;

    jit.source = blah;
    jit.size = blah;

    fd = syscall(440, TRAMPFD_JIT, &jit, flags);
    pread(fd, &map, sizeof(map), TRAMPFD_MAP_OFFSET);
    addr = mmap(NULL, map.size, map.prot, map.flags, fd, map.offset);

And addr would be used to invoke the generated JIT code.

Madhavan
Andy Lutomirski Aug. 2, 2020, 8 p.m. UTC | #25
On Sun, Aug 2, 2020 at 11:54 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
> More responses inline..
>
> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
> >> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
> >>
> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> >>
> >
> > 2. Use existing kernel functionality.  Raise a signal, modify the
> > state, and return from the signal.  This is very flexible and may not
> > be all that much slower than trampfd.
>
> Let me understand this. You are saying that the trampoline code
> would raise a signal and, in the signal handler, set up the context
> so that when the signal handler returns, we end up in the target
> function with the context correctly set up. And, this trampoline code
> can be generated statically at build time so that there are no
> security issues using it.
>
> Have I understood your suggestion correctly?

yes.

>
> So, my argument would be that this would always incur the overhead
> of a trip to the kernel. I think twice the overhead if I am not mistaken.
> With trampfd, we can have the kernel generate the code so that there
> is no performance penalty at all.

I feel like trampfd is too poorly defined at this point to evaluate.
There are three general things it could do.  It could generate actual
code that varies by instance.  It could have static code that does not
vary.  And it could actually involve a kernel entry.

If it involves a kernel entry, then it's slow.  Maybe this is okay for
some use cases.

If it involves only static code, I see no good reason that it should
be in the kernel.

If it involves dynamic code, then I think it needs a clearly defined
use case that actually requires dynamic code.

> Also, signals are asynchronous. So, they are vulnerable to race conditions.
> To prevent other signals from coming in while handling the raised signal,
> we would need to block and unblock signals. This will cause more
> overhead.

If you're worried about raise() racing against signals from out of
thread, you have bigger problems to deal with.
Madhavan T. Venkataraman Aug. 2, 2020, 10:58 p.m. UTC | #26
On 8/2/20 3:00 PM, Andy Lutomirski wrote:
> On Sun, Aug 2, 2020 at 11:54 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>> More responses inline..
>>
>> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>>>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>>>
>>>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>>>
>>> 2. Use existing kernel functionality.  Raise a signal, modify the
>>> state, and return from the signal.  This is very flexible and may not
>>> be all that much slower than trampfd.
>> Let me understand this. You are saying that the trampoline code
>> would raise a signal and, in the signal handler, set up the context
>> so that when the signal handler returns, we end up in the target
>> function with the context correctly set up. And, this trampoline code
>> can be generated statically at build time so that there are no
>> security issues using it.
>>
>> Have I understood your suggestion correctly?
> yes.
>
>> So, my argument would be that this would always incur the overhead
>> of a trip to the kernel. I think twice the overhead if I am not mistaken.
>> With trampfd, we can have the kernel generate the code so that there
>> is no performance penalty at all.
> I feel like trampfd is too poorly defined at this point to evaluate.
> There are three general things it could do.  It could generate actual
> code that varies by instance.  It could have static code that does not
> vary.  And it could actually involve a kernel entry.
>
> If it involves a kernel entry, then it's slow.  Maybe this is okay for
> some use cases.

Yes. IMO, it is OK for most cases except where dynamic code
is used specifically for enhancing performance such as interpreters
using JIT code for frequently executed sequences and dynamic
binary translation.

> If it involves only static code, I see no good reason that it should
> be in the kernel.

It does not involve only static code. This is meant for dynamic code.
However, see below.

> If it involves dynamic code, then I think it needs a clearly defined
> use case that actually requires dynamic code.

Fair enough. I will work on this and get back to you. This might take
a little time. So, bear with me.

But I would like to make one point here. There are many applications
and libraries out there that use trampolines. They may all require the
same sort of things:

    - set register context
    - push stuff on stack
    - jump to a target PC

But in each case, the context would be different:

    - only register context
    - only stack context
    - both register and stack contexts
    - different registers
    - different values pushed on the stack
    - different target PCs

If we had to do this purely at user level, each application/library would
need to roll its own solution, the solution has to be implemented for
each supported architecture and maintained. While the code is static
in each separate case, it is dynamic across all of them.

That is, the kernel will generate the code on the fly for each trampoline
instance based on its current context. It will not maintain any static
trampoline code at all.

Basically, it will supply the context to an arch-specific function and say:

    - generate instructions for loading these regs with these values
    - generate instructions to push these values on the stack
    - generate an instruction to jump to this target PC

It will place all of those generated instructions on a page and return the address.

So, even with the static case, there is a lot of value in the kernel providing
this. Plus, it has the framework to handle dynamic code.

>> Also, signals are asynchronous. So, they are vulnerable to race conditions.
>> To prevent other signals from coming in while handling the raised signal,
>> we would need to block and unblock signals. This will cause more
>> overhead.
> If you're worried about raise() racing against signals from out of
> thread, you have bigger problems to deal with.

Agreed. The signal blocking is just one example of problems related
to signals. There are other bigger problems as well. So, let us remove
the signal-based approach from our discussions.

Thanks.

Madhavan
David Laight Aug. 3, 2020, 8:08 a.m. UTC | #27
From: Pavel Machek <pavel@ucw.cz>
> Sent: 02 August 2020 12:56
> Hi!
> 
> > > This is quite clever, but now I???m wondering just how much kernel help
> > > is really needed. In your series, the trampoline is an non-executable
> > > page.  I can think of at least two alternative approaches, and I'd
> > > like to know the pros and cons.
> > >
> > > 1. Entirely userspace: a return trampoline would be something like:
> > >
> > > 1:
> > > pushq %rax
> > > pushq %rbc
> > > pushq %rcx
> > > ...
> > > pushq %r15
> > > movq %rsp, %rdi # pointer to saved regs
> > > leaq 1b(%rip), %rsi # pointer to the trampoline itself
> > > callq trampoline_handler # see below
> >
> > For nested calls (where the trampoline needs to pass the
> > original stack frame to the nested function) I think you
> > just need a page full of:
> > 	mov	$0, scratch_reg; jmp trampoline_handler
> 
> I believe you could do with mov %pc, scratch_reg; jmp ...
> 
> That has advantage of being able to share single physical
> page across multiple virtual pages...

A lot of architecture don't let you copy %pc that way so you would
have to use 'call' - but that trashes the return address cache.
It also needs the trampoline handler to know the addresses
of the trampolines.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Laight Aug. 3, 2020, 8:23 a.m. UTC | #28
From: Madhavan T. Venkataraman
> Sent: 02 August 2020 19:55
> To: Andy Lutomirski <luto@kernel.org>
> Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>; Linux API <linux-api@vger.kernel.org>;
> linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; Linux FS Devel <linux-
> fsdevel@vger.kernel.org>; linux-integrity <linux-integrity@vger.kernel.org>; LKML <linux-
> kernel@vger.kernel.org>; LSM List <linux-security-module@vger.kernel.org>; Oleg Nesterov
> <oleg@redhat.com>; X86 ML <x86@kernel.org>
> Subject: Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
> 
> More responses inline..
> 
> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
> >> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
> >>
> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> >>
> >
> > 2. Use existing kernel functionality.  Raise a signal, modify the
> > state, and return from the signal.  This is very flexible and may not
> > be all that much slower than trampfd.
> 
> Let me understand this. You are saying that the trampoline code
> would raise a signal and, in the signal handler, set up the context
> so that when the signal handler returns, we end up in the target
> function with the context correctly set up. And, this trampoline code
> can be generated statically at build time so that there are no
> security issues using it.
> 
> Have I understood your suggestion correctly?

I was thinking that you'd just let the 'not executable' page fault
signal happen (SIGSEGV?) when the code jumps to on-stack trampoline
is executed.

The user signal handler can then decode the faulting instruction
and, if it matches the expected on-stack trampoline, modify the
saved registers before returning from the signal.

No kernel changes and all you need to add to the program is
an architecture-dependant signal handler.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Laight Aug. 3, 2020, 8:27 a.m. UTC | #29
From: Mark Rutland
> Sent: 31 July 2020 19:32
...
> > It requires PC-relative data references. I have not worked on all architectures.
> > So, I need to study this. But do all ISAs support PC-relative data references?
> 
> Not all do, but pretty much any recent ISA will as it's a practical
> necessity for fast position-independent code.

i386 has neither PC-relative addressing nor moves from %pc.
The cpu architecture knows that the sequence:
	call	1f  
1:	pop	%reg  
is used to get the %pc value so is treated specially so that
it doesn't 'trash' the return stack.

So PIC code isn't too bad, but you have to use the correct
sequence.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Madhavan T. Venkataraman Aug. 3, 2020, 3:57 p.m. UTC | #30
On 8/3/20 3:08 AM, David Laight wrote:
> From: Pavel Machek <pavel@ucw.cz>
>> Sent: 02 August 2020 12:56
>> Hi!
>>
>>>> This is quite clever, but now I???m wondering just how much kernel help
>>>> is really needed. In your series, the trampoline is an non-executable
>>>> page.  I can think of at least two alternative approaches, and I'd
>>>> like to know the pros and cons.
>>>>
>>>> 1. Entirely userspace: a return trampoline would be something like:
>>>>
>>>> 1:
>>>> pushq %rax
>>>> pushq %rbc
>>>> pushq %rcx
>>>> ...
>>>> pushq %r15
>>>> movq %rsp, %rdi # pointer to saved regs
>>>> leaq 1b(%rip), %rsi # pointer to the trampoline itself
>>>> callq trampoline_handler # see below
>>> For nested calls (where the trampoline needs to pass the
>>> original stack frame to the nested function) I think you
>>> just need a page full of:
>>> 	mov	$0, scratch_reg; jmp trampoline_handler
>> I believe you could do with mov %pc, scratch_reg; jmp ...
>>
>> That has advantage of being able to share single physical
>> page across multiple virtual pages...
> A lot of architecture don't let you copy %pc that way so you would
> have to use 'call' - but that trashes the return address cache.
> It also needs the trampoline handler to know the addresses
> of the trampolines.

Do you which ones don't allow you to copy %pc?

Some of the architctures do not have PC-relative data references.
If they do not allow you to copy the PC into a general purpose
register, then there is no way to implement the statically defined
trampoline that has been discussed so far. In these cases, the
trampoline has to be generate at runtime.

Thanks.

Madhavan
Madhavan T. Venkataraman Aug. 3, 2020, 3:59 p.m. UTC | #31
On 8/3/20 3:23 AM, David Laight wrote:
> From: Madhavan T. Venkataraman
>> Sent: 02 August 2020 19:55
>> To: Andy Lutomirski <luto@kernel.org>
>> Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>; Linux API <linux-api@vger.kernel.org>;
>> linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; Linux FS Devel <linux-
>> fsdevel@vger.kernel.org>; linux-integrity <linux-integrity@vger.kernel.org>; LKML <linux-
>> kernel@vger.kernel.org>; LSM List <linux-security-module@vger.kernel.org>; Oleg Nesterov
>> <oleg@redhat.com>; X86 ML <x86@kernel.org>
>> Subject: Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
>>
>> More responses inline..
>>
>> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>>>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>>>
>>>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>>>
>>> 2. Use existing kernel functionality.  Raise a signal, modify the
>>> state, and return from the signal.  This is very flexible and may not
>>> be all that much slower than trampfd.
>> Let me understand this. You are saying that the trampoline code
>> would raise a signal and, in the signal handler, set up the context
>> so that when the signal handler returns, we end up in the target
>> function with the context correctly set up. And, this trampoline code
>> can be generated statically at build time so that there are no
>> security issues using it.
>>
>> Have I understood your suggestion correctly?
> I was thinking that you'd just let the 'not executable' page fault
> signal happen (SIGSEGV?) when the code jumps to on-stack trampoline
> is executed.
>
> The user signal handler can then decode the faulting instruction
> and, if it matches the expected on-stack trampoline, modify the
> saved registers before returning from the signal.
>
> No kernel changes and all you need to add to the program is
> an architecture-dependant signal handler.

Understood.

Madhavan
Madhavan T. Venkataraman Aug. 3, 2020, 4:03 p.m. UTC | #32
On 8/3/20 3:27 AM, David Laight wrote:
> From: Mark Rutland
>> Sent: 31 July 2020 19:32
> ...
>>> It requires PC-relative data references. I have not worked on all architectures.
>>> So, I need to study this. But do all ISAs support PC-relative data references?
>> Not all do, but pretty much any recent ISA will as it's a practical
>> necessity for fast position-independent code.
> i386 has neither PC-relative addressing nor moves from %pc.
> The cpu architecture knows that the sequence:
> 	call	1f  
> 1:	pop	%reg  
> is used to get the %pc value so is treated specially so that
> it doesn't 'trash' the return stack.
>
> So PIC code isn't too bad, but you have to use the correct
> sequence.

Is that true only for 32-bit systems only? I thought RIP-relative addressing was
introduced in 64-bit mode. Please confirm.

Madhavan
David Laight Aug. 3, 2020, 4:57 p.m. UTC | #33
From: Madhavan T. Venkataraman
> Sent: 03 August 2020 17:03
> 
> On 8/3/20 3:27 AM, David Laight wrote:
> > From: Mark Rutland
> >> Sent: 31 July 2020 19:32
> > ...
> >>> It requires PC-relative data references. I have not worked on all architectures.
> >>> So, I need to study this. But do all ISAs support PC-relative data references?
> >> Not all do, but pretty much any recent ISA will as it's a practical
> >> necessity for fast position-independent code.
> > i386 has neither PC-relative addressing nor moves from %pc.
> > The cpu architecture knows that the sequence:
> > 	call	1f
> > 1:	pop	%reg
> > is used to get the %pc value so is treated specially so that
> > it doesn't 'trash' the return stack.
> >
> > So PIC code isn't too bad, but you have to use the correct
> > sequence.
> 
> Is that true only for 32-bit systems only? I thought RIP-relative addressing was
> introduced in 64-bit mode. Please confirm.

I said i386 not amd64 or x86-64.

So yes, 64bit code has PC-relative addressing.
But I'm pretty sure it has no other way to get the PC itself
except using call - certainly nothing in the 'usual' instructions.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Madhavan T. Venkataraman Aug. 3, 2020, 4:57 p.m. UTC | #34
Responses inline..

On 7/31/20 1:09 PM, Mark Rutland wrote:
> Hi,
>
> On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>> Trampoline code is placed either in a data page or in a stack page. In
>> order to execute a trampoline, the page it resides in needs to be mapped
>> with execute permissions. Writable pages with execute permissions provide
>> an attack surface for hackers. Attackers can use this to inject malicious
>> code, modify existing code or do other harm.
> For the purpose of below, IIUC this assumes the adversary has an
> arbitrary write.
>
>> To mitigate this, LSMs such as SELinux may not allow pages to have both
>> write and execute permissions. This prevents trampolines from executing
>> and blocks applications that use trampolines. To allow genuine applications
>> to run, exceptions have to be made for them (by setting execmem, etc).
>> In this case, the attack surface is just the pages of such applications.
>>
>> An application that is not allowed to have writable executable pages
>> may try to load trampoline code into a file and map the file with execute
>> permissions. In this case, the attack surface is just the buffer that
>> contains trampoline code. However, a successful exploit may provide the
>> hacker with means to load his own code in a file, map it and execute it.
> It's not clear to me what power the adversary is assumed to have here,
> and consequently it's not clear to me how the proposal mitigates this.
>
> For example, if the attack can control the arguments to syscalls, and
> has an arbitrary write as above, what prevents them from creating a
> trampfd of their own?

That is the point. If a process is allowed to have pages that are both
writable and executable, a hacker can exploit some vulnerability such
as buffer overflow to write his own code into a page and somehow
contrive to execute that.

So, the context is - if security settings in a system disallow a page to have
both write and execute permissions, how do you allow the execution of
genuine trampolines that are runtime generated and placed in a data
page or a stack page?

trampfd tries to address that. So, trampfd is not a measure that increases
the security of a system or mitigates a security problem. It is a framework
to allow safe forms of dynamic code to execute when security settings
will block them otherwise.

>
> [...]
>
>> GCC has traditionally used trampolines for implementing nested
>> functions. The trampoline is placed on the user stack. So, the stack
>> needs to be executable.
> IIUC generally nested functions are avoided these days, specifically to
> prevent the creation of gadgets on the stack. So I don't think those are
> relevant as a cased to care about. Applications using them should move
> to not using them, and would be more secure generally for doing so.

Could not agree with you more.
>
> [...]
>
>> Trampoline File Descriptor (trampfd)
>> --------------------------
>>
>> I am proposing a kernel API using anonymous file descriptors that
>> can be used to create and execute trampolines with the help of the
>> kernel. In this solution also, the kernel does the work of the trampoline.
> What's the rationale for the kernel emulating the trampoline here?
>
> In ther case of EMUTRAMP this was necessary to work with existing
> application binaries and kernel ABIs which placed instructions onto the
> stack, and the stack needed to remain RW for other reasons. That
> restriction doesn't apply here.

In addition to the stack, EMUTRAMP also allows the emulation
of the same well-known trampolines placed in a non-stack data page.
For instance, libffi closures embed a trampoline in a closure structure.
That gets executed when the caller of libffi invokes it.

The goal of EMUTRAMP is to allow safe trampolines to execute when
security settings disallow their execution. Mainly, it permits applications
that use libffi to run. A lot of applications use libffi.

They chose the emulation method so that no changes need to be made
to application code to use them. But the EMUTRAMP implementors note
in their description that the real solution to the problem is a kernel
API that is backed by a safe code generator.

trampd is an attempt to define such an API. This is just a starting point.
I realize that we need to have a lot of discussion to refine the approach.

> Assuming trampfd creation is somehow authenticated, the code could be
> placed in a r-x page (which the kernel could refuse to add write
> permission), in order to prevent modification. If that's sufficient,
> it's not much of a leap to allow userspace to generate the code.

IIUC, you are suggesting that the user hands the kernel a code fragment
and requests it to be placed in an r-x page, correct? However, the
kernel cannot trust any code given to it by the user. Nor can it scan any
piece of code and reliably decide if it is safe or not.

So, the problem of executing dynamic code when security settings are
restrictive cannot be solved in userland. The only option I can think of is
to have the kernel provide support for dynamic code. It must have one
or more safe, trusted code generation components and an API to use
the components.

My goal is to introduce an API and start off by supporting simple, regular
trampolines that are widely used. Then, evolve the feature over a period
of time to include other forms of dynamic code such as JIT code.

>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
>>
>> In this case, the attack surface is the context buffer. A hacker may
>> attack an application with a vulnerability and may be able to modify the
>> context buffer. So, when the register or stack context is set for
>> a trampoline, the values may have been tampered with. From an attack
>> surface perspective, this is similar to Trampoline Emulation. But
>> with trampfd, user code can retrieve a trampoline's context from the
>> kernel and add defensive checks to see if the context has been
>> tampered with.
> Can you elaborate on this: what sort of checks would be applied, and
> how?

So, an application that uses trampfd would do the following steps:

1. Create a trampoline by calling trampfd_create()
2. Set the register and/or stack contexts for the trampoline.
3. mmap() the trampoline to get an address
4. Invoke the trampoline using the address

Let us say that the application has a vulnerability such as buffer overflow
that allows a hacker to modify the data that is used to do step 2.

Potentially, a hacker could modify the following things:
    - register values specified in the register context
    - values specified in the stack context
    - the target PC specified in the register context

When the trampoline is invoked in step 4, the kernel will gain control,
load the registers, push stuff on the stack and transfer control to the target
PC. Whatever the hacker had modified in step 2 will take effect in step 4.
His values will get loaded and his PC is the one that will get control.

A paranoid application could add a step to this sequence. So, the steps
would be:

1. Create a trampoline by calling trampfd_create()
2. Set the register and/or stack contexts for the trampoline.
3. mmap() the trampoline to get an address
4a. Retrieve the register and stack context for the trampoline from the
      kernel and check if anything has been altered. If yes, abort.
4b. Invoke the trampoline using the address

The check that I mentioned will be in step 4a. Now, the hacker has to
hack both step 2 and step 4a to let his stuff take effect. That is far
less likely to succeed because there needs to exist a vulnerability in
both places.

> Why is this not possible in a r-x user page?

This is answered above.
>
> [...]
>
>> - trampfd provides a basic framework. In the future, new trampoline types
>>   can be implemented, new contexts can be defined, and additional rules
>>   can be implemented for security purposes.
> >From a kernel developer perspective, this reads as "this ABI will become
> more complex", which I think is worrisome.

I hear you. My goal from the beginning is to not have the kernel deal
with ABI issues. ABI handling is best left to userland (except in cases
like signal handlers where the kernel does have to deal with it).

In the libffi changes, this is certainly true. The kernel only helps with
the trampoline that passes control to the ABI handler. The ABI handler
itself is part of libffi.

> I'm also worried that this is liable to have nasty interaction with HW
> CFI mechanisms (e.g. PAC+BTI on arm64) either now or in future, and that
> we bake incompatibility into ABI.

I will study CFI and then answer this question. So, bear with me.
>> - For instance, trampfd defines an "Allowed PCs" context in this initial
>>   work. As an example, libffi can create a read-only array of all ABI
>>   handlers for an architecture at build time. This array can be used to
>>   set the list of allowed PCs for a trampoline. This will mean that a hacker
>>   cannot hack the PC part of the register context and make it point to
>>   arbitrary locations.
> I'm not exactly sure what's meant here. Do you mean that this prevents
> userspace from branching into the middle of a trampoline, or that the
> trampfd code prevents where the trampoline itself can branch to?
>
> Both x86 and arm64 have upcoming HW CFI (CET and BTI) to deal with the
> former, and I believe the latter can also be implemented in userspace
> with defensive checks in the trampolines, provided that they are
> protected read-only.

So, I mentioned before that a hacker can potentially alter the target
PC that a trampoline finally jumps to.

If a process were allowed to have pages with both write and execute
permissions, a hacker could load his own code in one of those pages and
point the PC to that.

In the context of trampfd, we are talking about the case where a process is
not permitted to have both write and execute permissions. In this case,
the hacker cannot load his own code anywhere and hope to execute it.
But a hacker can point the PC to some arbitrary place such as return
from glibc.

>
>> - An SELinux setting called "exectramp" can be implemented along the
>>   lines of "execmem", "execstack" and "execheap" to selectively allow the
>>   use of trampolines on a per application basis.
>>
>> - User code can add defensive checks in the code before invoking a
>>   trampoline to make sure that a hacker has not modified the context data.
>>   It can do this by getting the trampoline context from the kernel and
>>   double checking it.
> As above, without examples it's not clear to me what sort of chacks are
> possible nor where they wouild need to be made. So it's difficult to see
> whether that's actually possible or subject to TOCTTOU races and
> similar.

I have explained this above. If there are any further questions on that,
please let me know.

>
>> - In the future, if the kernel can be enhanced to use a safe code
>>   generation component, that code can be placed in the trampoline mapping
>>   pages. Then, the trampoline invocation does not have to incur a trip
>>   into the kernel.
>>
>> - Also, if the kernel can be enhanced to use a safe code generation
>>   component, other forms of dynamic code such as JIT code can be
>>   addressed by the trampfd framework.
> I don't see why it's necessary for the kernel to generate code at all.
> If the trampfd creation requests can be trusted, what prevents trusting
> a sealed set of instructions generated in userspace?

Let us consider a system in which:
    - a process is not permitted to have pages with both write and execute
    - a process is not permitted to map any file as executable unless it
      is properly signed. In other words, cryptographically verified.

Then, the process cannot execute any code that is runtime generated.
That includes trampolines. Only trampoline code that is part of program
text at build time would be permitted to execute.

In this scenario, trampfd requests are coming from signed code. So, they
are trusted by the kernel. But trampoline code could be dynamically generated.
The kernel will not trust it.

>> - Trampolines can be shared across processes which can give rise to
>>   interesting uses in the future.
> This sounds like the use-case of a sealed memfd. Is a sealed executable
> memfd not sufficient?

I will answer this in a separate email.

Thanks.

Madhavan
Madhavan T. Venkataraman Aug. 3, 2020, 5 p.m. UTC | #35
On 8/3/20 11:57 AM, David Laight wrote:
> From: Madhavan T. Venkataraman
>> Sent: 03 August 2020 17:03
>>
>> On 8/3/20 3:27 AM, David Laight wrote:
>>> From: Mark Rutland
>>>> Sent: 31 July 2020 19:32
>>> ...
>>>>> It requires PC-relative data references. I have not worked on all architectures.
>>>>> So, I need to study this. But do all ISAs support PC-relative data references?
>>>> Not all do, but pretty much any recent ISA will as it's a practical
>>>> necessity for fast position-independent code.
>>> i386 has neither PC-relative addressing nor moves from %pc.
>>> The cpu architecture knows that the sequence:
>>> 	call	1f
>>> 1:	pop	%reg
>>> is used to get the %pc value so is treated specially so that
>>> it doesn't 'trash' the return stack.
>>>
>>> So PIC code isn't too bad, but you have to use the correct
>>> sequence.
>> Is that true only for 32-bit systems only? I thought RIP-relative addressing was
>> introduced in 64-bit mode. Please confirm.
> I said i386 not amd64 or x86-64.

I am sorry. My bad.

>
> So yes, 64bit code has PC-relative addressing.
> But I'm pretty sure it has no other way to get the PC itself
> except using call - certainly nothing in the 'usual' instructions.

OK.

Madhavan
Madhavan T. Venkataraman Aug. 3, 2020, 5:58 p.m. UTC | #36
On 7/31/20 1:31 PM, Mark Rutland wrote:
> On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
>> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
>>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
>>> <madvenka@linux.microsoft.com> wrote:
>> Dealing with multiple architectures
>> -----------------------------------------------
>>
>> One good reason to use trampfd is multiple architecture support. The
>> trampoline table in a code page approach is neat. I don't deny that at
>> all. But my question is - can it be used in all cases?
>>
>> It requires PC-relative data references. I have not worked on all architectures.
>> So, I need to study this. But do all ISAs support PC-relative data references?
> Not all do, but pretty much any recent ISA will as it's a practical
> necessity for fast position-independent code.

So, two questions:

1. IIUC, for position independent code, we need PC-relative control transfers. I know that
    PC-relative control transfers are kinda fundamental. So, I expect most architectures
    support it. But to implement the trampoline table suggestion, we need PC-relative
    data references. Like:

    movq    X(%rip), %rax

2. Do you know which architectures do not support PC-relative data references? I am
    going to study this. But if you have some information, I would appreciate it.

In any case, I think we should support all of the architectures on which Linux currently
runs even if they are legacy.

>
>> Even in an ISA that supports it, there would be a maximum supported offset
>> from the current PC that can be reached for a data reference. That maximum
>> needs to be at least the size of a base page in the architecture. This is because
>> the code page and the data page need to be separate for security reasons.
>> Do all ISAs support a sufficiently large offset?
> ISAs with pc-relative addessing can usually generate PC-relative
> addresses into a GPR, from which they can apply an arbitrarily large
> offset.

I will study this. I need to nail down the list of architectures that cannot do this.

>
>> When the kernel generates the code for a trampoline, it can hard code data values
>> in the generated code itself so it does not need PC-relative data referencing.
>>
>> And, for ISAs that do support the large offset, we do have to implement and
>> maintain the code page stuff for different ISAs for each application and library
>> if we did not use trampfd.
> Trampoline code is architecture specific today, so I don't see that as a
> major issue. Common structural bits can probably be shared even if the
> specifid machine code cannot.

True. But an implementor may prefer a standard mechanism provided by
the kernel so all of his architectures can be supported easily with less
effort.

If you look at the libffi reference patch I have included, the architecture
specific changes to use trampfd just involve a single C function call to
a common code function.

So, from the point of view of adoption, IMHO, the kernel provided method
is preferable.

>
> [...]
>
>> Security
>> -----------
>>
>> With the user level trampoline table approach, the data part of the trampoline table
>> can be hacked by an attacker if an application has a vulnerability. Specifically, the
>> target PC can be altered to some arbitrary location. Trampfd implements an
>> "Allowed PCS" context. In the libffi changes, I have created a read-only array of
>> all ABI handlers used in closures for each architecture. This read-only array
>> can be used to restrict the PC values for libffi trampolines to prevent hacking.
>>
>> To generalize, we can implement security rules/features if the trampoline
>> object is in the kernel.
> I don't follow this argument. If it's possible to statically define that
> in the kernel, it's also possible to do that in userspace without any
> new kernel support.
It is not statically defined in the kernel.

Let us take the libffi example. In the 64-bit X86 arch code, there are 3
ABI handlers:

    ffi_closure_unix64_sse
    ffi_closure_unix64
    ffi_closure_win64

I could create an "Allowed PCs" context like this:

struct my_allowed_pcs {
    struct trampfd_values    pcs;
    __u64                             pc_values[3];
};

const struct my_allowed_pcs    my_allowed_pcs = {
    { 3, 0 },
    (uintptr_t) ffi_closure_unix64_sse,
    (uintptr_t) ffi_closure_unix64,
    (uintptr_t) ffi_closure_win64,
};

I have created a read-only array of allowed ABI handlers that closures use.

When I set up the context for a closure trampoline, I could do this:

    pwrite(trampfd, &my_allowed_pcs, sizeof(my_allowed_pcs), TRAMPFD_ALLOWED_PCS_OFFSET);
   
This copies the array into the trampoline object in the kernel.
When the register context is set for the trampoline, the kernel checks
the PC register value against allowed PCs.

Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only
permitted target PCs enforced by the kernel are the ABI handlers.
>
> [...]
>
>> Trampfd is a framework that can be used to implement multiple things. May be,
>> a few of those things can also be implemented in user land itself. But I think having
>> just one mechanism to execute dynamic code objects is preferable to having
>> multiple mechanisms not standardized across all applications.
> In abstract, having a common interface sounds nice, but in practice
> elements of this are always architecture-specific (e.g. interactiosn
> with HW CFI), and that common interface can result in more pain as it
> doesn't fit naturally into the context that ISAs were designed for (e.g. 
> where control-flow instructions are extended with new semantics).

In the case of trampfd, the code generation is indeed architecture
specific. But that is in the kernel. The application is not affected by it.

Again, referring to the libffi reference patch, I have defined wrapper
functions for trampfd in common code. The architecture specific code
in libffi only calls the set_context function defined in common code.
Even this is required only because register names are specific to each
architecture and the target PC (to the ABI handler) is specific to
each architecture-ABI combo.

> It also meass that you can't share the rough approach across OSs which
> do not implement an identical mechanism, so for code abstracting by ISA
> first, then by platform/ABI, there isn't much saving.

Why can you not share the same approach across OSes? In fact,
I have tried to design it so that other OSes can use the same
mechanism.

The only thing is that I have defined the API to be based on a file
descriptor since that is what is generally preferred by the Linux community
for a new API. If I were to implement it as a regular system call, the same
system call can be implemented in other OSes as well.

Thanks.

Madhavan
Madhavan T. Venkataraman Aug. 3, 2020, 6:36 p.m. UTC | #37
On 8/2/20 3:00 PM, Andy Lutomirski wrote:
> I feel like trampfd is too poorly defined at this point to evaluate.

Point taken. It is because I wanted to start with something small
and specific and expand it in the future. So, I did not really describe the big
picture - the overall vision, future work, that sort of thing. In retrospect,
may be, I should have done that.

I will take all of the input I have received so far and all of the responses
I have given, refine the definition of trampfd and send it out. Please
review that and let me know if anything is still missing from the
definition.

Thanks.

Madhavan
Mark Rutland Aug. 4, 2020, 1:55 p.m. UTC | #38
On Mon, Aug 03, 2020 at 12:58:04PM -0500, Madhavan T. Venkataraman wrote:
> On 7/31/20 1:31 PM, Mark Rutland wrote:
> > On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
> >> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
> >>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
> >>> <madvenka@linux.microsoft.com> wrote:
>> >> When the kernel generates the code for a trampoline, it can hard code data values
> >> in the generated code itself so it does not need PC-relative data referencing.
> >>
> >> And, for ISAs that do support the large offset, we do have to implement and
> >> maintain the code page stuff for different ISAs for each application and library
> >> if we did not use trampfd.
> > Trampoline code is architecture specific today, so I don't see that as a
> > major issue. Common structural bits can probably be shared even if the
> > specifid machine code cannot.
> 
> True. But an implementor may prefer a standard mechanism provided by
> the kernel so all of his architectures can be supported easily with less
> effort.
> 
> If you look at the libffi reference patch I have included, the architecture
> specific changes to use trampfd just involve a single C function call to
> a common code function.

Sure but in addition to that each architecture backend had to define a
set of arguments to that. I view the C function is analagous to the
"common structural bits".

I appreciate that your patch is small today (and architectures seem to
largely align on what they need), but I don't think it's necessarily
true that things will remain so simple as architecture are extended and
their calling conventions evolve, and I also don't think it's clear that
this will work for more complex cases elsewhere.

[...]

> >> With the user level trampoline table approach, the data part of the trampoline table
> >> can be hacked by an attacker if an application has a vulnerability. Specifically, the
> >> target PC can be altered to some arbitrary location. Trampfd implements an
> >> "Allowed PCS" context. In the libffi changes, I have created a read-only array of
> >> all ABI handlers used in closures for each architecture. This read-only array
> >> can be used to restrict the PC values for libffi trampolines to prevent hacking.
> >>
> >> To generalize, we can implement security rules/features if the trampoline
> >> object is in the kernel.
> > I don't follow this argument. If it's possible to statically define that
> > in the kernel, it's also possible to do that in userspace without any
> > new kernel support.
> It is not statically defined in the kernel.
> 
> Let us take the libffi example. In the 64-bit X86 arch code, there are 3
> ABI handlers:
> 
>     ffi_closure_unix64_sse
>     ffi_closure_unix64
>     ffi_closure_win64
> 
> I could create an "Allowed PCs" context like this:
> 
> struct my_allowed_pcs {
>     struct trampfd_values    pcs;
>     __u64                             pc_values[3];
> };
> 
> const struct my_allowed_pcs    my_allowed_pcs = {
>     { 3, 0 },
>     (uintptr_t) ffi_closure_unix64_sse,
>     (uintptr_t) ffi_closure_unix64,
>     (uintptr_t) ffi_closure_win64,
> };
> 
> I have created a read-only array of allowed ABI handlers that closures use.
> 
> When I set up the context for a closure trampoline, I could do this:
> 
>     pwrite(trampfd, &my_allowed_pcs, sizeof(my_allowed_pcs), TRAMPFD_ALLOWED_PCS_OFFSET);
>    
> This copies the array into the trampoline object in the kernel.
> When the register context is set for the trampoline, the kernel checks
> the PC register value against allowed PCs.
> 
> Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only
> permitted target PCs enforced by the kernel are the ABI handlers.

Sorry, when I said "statically define" meant when you knew legitimate
targets ahead of time when you create the trampoline (i.e. whether you
could enumerate those and know they would not change dynamically).

My point was that you can achieve the same in userspace if the
trampoline and array of legitimate targets are in read-only memory,
without having to trap to the kernel.

I think the key point here is that an adversary must be prevented from
altering a trampoline and any associated metadata, and I think that
there are ways of achieving that without having to trap into the kernel,
and without the kernel having to be intimately aware of the calling
conventions used in userspace.

[...]

> >> Trampfd is a framework that can be used to implement multiple things. May be,
> >> a few of those things can also be implemented in user land itself. But I think having
> >> just one mechanism to execute dynamic code objects is preferable to having
> >> multiple mechanisms not standardized across all applications.
> > In abstract, having a common interface sounds nice, but in practice
> > elements of this are always architecture-specific (e.g. interactiosn
> > with HW CFI), and that common interface can result in more pain as it
> > doesn't fit naturally into the context that ISAs were designed for (e.g. 
> > where control-flow instructions are extended with new semantics).
> 
> In the case of trampfd, the code generation is indeed architecture
> specific. But that is in the kernel. The application is not affected by it.

As an ABI detail, applications are *definitely* affected by this, and it
is wrong to suggest they are not even if you don't have a specific case
in mind today. As this forms a contract between userspace and the kernel
it's overly simplistic to say that it's the kernel's problem

For example, in the case of BTI on arm64, what should the trampoline
set PSTATE.BTYPE to? Different use-cases *will* want different values,
and not necessarily the value of PSTATE at the instant the call to the
trampoline was made. In the case of libffi specifically using the
original value of PSTATE.BTYPE probably is sound, but other code
sequences may need to restrict/broaden or entirely change that.

> Again, referring to the libffi reference patch, I have defined wrapper
> functions for trampfd in common code. The architecture specific code
> in libffi only calls the set_context function defined in common code.
> Even this is required only because register names are specific to each
> architecture and the target PC (to the ABI handler) is specific to
> each architecture-ABI combo.
> 
> > It also meass that you can't share the rough approach across OSs which
> > do not implement an identical mechanism, so for code abstracting by ISA
> > first, then by platform/ABI, there isn't much saving.
> 
> Why can you not share the same approach across OSes? In fact,
> I have tried to design it so that other OSes can use the same
> mechanism.

Sure, but where they *don't*, you must fall back to the existing
purely-userspace mechanisms, and so a codebase now has the burden of
maintaining two distinct mechanisms.

Whereas if there's a way of doing this in userspace with (stronger)
enforcement of memory permissions the trampoline code can be common for
when this is present or absent, which is much easier for a codebase rto
maintain, and could make use of weaker existing mechanisms to improve
the situation on systems without the new functionality.

Thanks,
Mark.
Mark Rutland Aug. 4, 2020, 2:30 p.m. UTC | #39
On Mon, Aug 03, 2020 at 11:57:57AM -0500, Madhavan T. Venkataraman wrote:
> Responses inline..
> 
> On 7/31/20 1:09 PM, Mark Rutland wrote:
> > Hi,
> >
> > On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote:
> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> >> Trampoline code is placed either in a data page or in a stack page. In
> >> order to execute a trampoline, the page it resides in needs to be mapped
> >> with execute permissions. Writable pages with execute permissions provide
> >> an attack surface for hackers. Attackers can use this to inject malicious
> >> code, modify existing code or do other harm.
> > For the purpose of below, IIUC this assumes the adversary has an
> > arbitrary write.
> >
> >> To mitigate this, LSMs such as SELinux may not allow pages to have both
> >> write and execute permissions. This prevents trampolines from executing
> >> and blocks applications that use trampolines. To allow genuine applications
> >> to run, exceptions have to be made for them (by setting execmem, etc).
> >> In this case, the attack surface is just the pages of such applications.
> >>
> >> An application that is not allowed to have writable executable pages
> >> may try to load trampoline code into a file and map the file with execute
> >> permissions. In this case, the attack surface is just the buffer that
> >> contains trampoline code. However, a successful exploit may provide the
> >> hacker with means to load his own code in a file, map it and execute it.
> > It's not clear to me what power the adversary is assumed to have here,
> > and consequently it's not clear to me how the proposal mitigates this.
> >
> > For example, if the attack can control the arguments to syscalls, and
> > has an arbitrary write as above, what prevents them from creating a
> > trampfd of their own?
> 
> That is the point. If a process is allowed to have pages that are both
> writable and executable, a hacker can exploit some vulnerability such
> as buffer overflow to write his own code into a page and somehow
> contrive to execute that.

I understood that, and that was not my question.

> So, the context is - if security settings in a system disallow a page to have
> both write and execute permissions, how do you allow the execution of
> genuine trampolines that are runtime generated and placed in a data
> page or a stack page?

There are options today, e.g.

a) If the restriction is only per-alias, you can have distinct aliases
   where one is writable and another is executable, and you can make it
   hard to find the relationship between the two.

b) If the restriction is only temporal, you can write instructions into
   an RW- buffer, transition the buffer to R--, verify the buffer
   contents, then transition it to --X.

c) You can have two processes A and B where A generates instrucitons into
   a buffer that (only) B can execute (where B may be restricted from
   making syscalls like write, mprotect, etc).

If (as this series appears to) you assume that an adversary can't
control the arguments trampfd_create() and any such call is legitimate,
then something like (b) is not weaker and can be much more general
without many of the potential ABI or performance problems of trying to
fiddle with precedure call details in the kernel.

If that's not an assumption, then I'm missing how you expect to
determine that a trampfd_create() call is legitimate, and why that could
not be applied to other calls.

[...]

> Could not agree with you more.
> >
> > [...]
> >
> >> Trampoline File Descriptor (trampfd)
> >> --------------------------
> >>
> >> I am proposing a kernel API using anonymous file descriptors that
> >> can be used to create and execute trampolines with the help of the
> >> kernel. In this solution also, the kernel does the work of the trampoline.
> > What's the rationale for the kernel emulating the trampoline here?
> >
> > In ther case of EMUTRAMP this was necessary to work with existing
> > application binaries and kernel ABIs which placed instructions onto the
> > stack, and the stack needed to remain RW for other reasons. That
> > restriction doesn't apply here.
> 
> In addition to the stack, EMUTRAMP also allows the emulation
> of the same well-known trampolines placed in a non-stack data page.
> For instance, libffi closures embed a trampoline in a closure structure.
> That gets executed when the caller of libffi invokes it.
> 
> The goal of EMUTRAMP is to allow safe trampolines to execute when
> security settings disallow their execution. Mainly, it permits applications
> that use libffi to run. A lot of applications use libffi.
> 
> They chose the emulation method so that no changes need to be made
> to application code to use them. But the EMUTRAMP implementors note
> in their description that the real solution to the problem is a kernel
> API that is backed by a safe code generator.
> 
> trampd is an attempt to define such an API. This is just a starting point.
> I realize that we need to have a lot of discussion to refine the approach.
> 
> > Assuming trampfd creation is somehow authenticated, the code could be
> > placed in a r-x page (which the kernel could refuse to add write
> > permission), in order to prevent modification. If that's sufficient,
> > it's not much of a leap to allow userspace to generate the code.
> 
> IIUC, you are suggesting that the user hands the kernel a code fragment
> and requests it to be placed in an r-x page, correct? However, the
> kernel cannot trust any code given to it by the user. Nor can it scan any
> piece of code and reliably decide if it is safe or not.

Per that same logic the kernel cannot trust trampfd creation calls to be
legitimate as the adversary could mess with the arguments. It doesn't
matter if the kernel's codegen is trustworthy if it's potentially driven
by an adversary.

> So, the problem of executing dynamic code when security settings are
> restrictive cannot be solved in userland. The only option I can think of is
> to have the kernel provide support for dynamic code. It must have one
> or more safe, trusted code generation components and an API to use
> the components.
> 
> My goal is to introduce an API and start off by supporting simple, regular
> trampolines that are widely used. Then, evolve the feature over a period
> of time to include other forms of dynamic code such as JIT code.

I think that you're making a leap to this approach without sufficient
justification that it actually solves the problem, and I believe that
there will be ABI issues with this approach which can be sidestepped by
other potential approaches.

Taking a step back, I think it's necessary to better describe the
problem and constraints that you believe apply before attempting to
justify any potential solution.

[...]

> >> The kernel creates the trampoline mapping without any permissions. When
> >> the trampoline is executed by user code, a page fault happens and the
> >> kernel gets control. The kernel recognizes that this is a trampoline
> >> invocation. It sets up the user registers based on the specified
> >> register context, and/or pushes values on the user stack based on the
> >> specified stack context, and sets the user PC to the requested target
> >> PC. When the kernel returns, execution continues at the target PC.
> >> So, the kernel does the work of the trampoline on behalf of the
> >> application.
> >>
> >> In this case, the attack surface is the context buffer. A hacker may
> >> attack an application with a vulnerability and may be able to modify the
> >> context buffer. So, when the register or stack context is set for
> >> a trampoline, the values may have been tampered with. From an attack
> >> surface perspective, this is similar to Trampoline Emulation. But
> >> with trampfd, user code can retrieve a trampoline's context from the
> >> kernel and add defensive checks to see if the context has been
> >> tampered with.
> > Can you elaborate on this: what sort of checks would be applied, and
> > how?
> 
> So, an application that uses trampfd would do the following steps:
> 
> 1. Create a trampoline by calling trampfd_create()
> 2. Set the register and/or stack contexts for the trampoline.
> 3. mmap() the trampoline to get an address
> 4. Invoke the trampoline using the address
> 
> Let us say that the application has a vulnerability such as buffer overflow
> that allows a hacker to modify the data that is used to do step 2.
> 
> Potentially, a hacker could modify the following things:
>     - register values specified in the register context
>     - values specified in the stack context
>     - the target PC specified in the register context
> 
> When the trampoline is invoked in step 4, the kernel will gain control,
> load the registers, push stuff on the stack and transfer control to the target
> PC. Whatever the hacker had modified in step 2 will take effect in step 4.
> His values will get loaded and his PC is the one that will get control.
> 
> A paranoid application could add a step to this sequence. So, the steps
> would be:
> 
> 1. Create a trampoline by calling trampfd_create()
> 2. Set the register and/or stack contexts for the trampoline.
> 3. mmap() the trampoline to get an address
> 4a. Retrieve the register and stack context for the trampoline from the
>       kernel and check if anything has been altered. If yes, abort.
> 4b. Invoke the trampoline using the address

As above, you can also do this when using mprotect today, transitioning
the buffer RWX -> R-- -> R-X. If you're worried about subsequent
modification via an alias, a sealed memfd would work assuming that can
be mapped R-X.

This approach is applicable to trampfd, but it isn't a specific benefit
of trampfd.

[...] 

> >> - In the future, if the kernel can be enhanced to use a safe code
> >>   generation component, that code can be placed in the trampoline mapping
> >>   pages. Then, the trampoline invocation does not have to incur a trip
> >>   into the kernel.
> >>
> >> - Also, if the kernel can be enhanced to use a safe code generation
> >>   component, other forms of dynamic code such as JIT code can be
> >>   addressed by the trampfd framework.
> > I don't see why it's necessary for the kernel to generate code at all.
> > If the trampfd creation requests can be trusted, what prevents trusting
> > a sealed set of instructions generated in userspace?
> 
> Let us consider a system in which:
>     - a process is not permitted to have pages with both write and execute
>     - a process is not permitted to map any file as executable unless it
>       is properly signed. In other words, cryptographically verified.
> 
> Then, the process cannot execute any code that is runtime generated.
> That includes trampolines. Only trampoline code that is part of program
> text at build time would be permitted to execute.
> 
> In this scenario, trampfd requests are coming from signed code. So, they
> are trusted by the kernel. But trampoline code could be dynamically generated.
> The kernel will not trust it.

I think this a very hand-wavy argument, as it suggests that generated
code is not trusted, but what is effectively a generated bytecode is.

If certain codegen can be trusted, then we can add mechanisms to permit
the results of this to be mapped r-x. If that is not possible, then the
same argument says that trampfd requests cannot be trusted.

Thanks,
Mark.
David Laight Aug. 4, 2020, 2:33 p.m. UTC | #40
> > If you look at the libffi reference patch I have included, the architecture
> > specific changes to use trampfd just involve a single C function call to
> > a common code function.

No idea what libffi is, but it must surely be simpler to
rewrite it to avoid nested function definitions.

Or find a book from the 1960s on how to do recursive
calls and nested functions in FORTRAN-IV.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Laight Aug. 4, 2020, 2:44 p.m. UTC | #41
> > > If you look at the libffi reference patch I have included, the architecture
> > > specific changes to use trampfd just involve a single C function call to
> > > a common code function.
> 
> No idea what libffi is, but it must surely be simpler to
> rewrite it to avoid nested function definitions.
> 
> Or find a book from the 1960s on how to do recursive
> calls and nested functions in FORTRAN-IV.

FWIW it is probably as simple as:
1) Put all the 'variables' the nested function accesses into a struct.
2) Add a field for the address of the 'nested' function.
3) Pass the address of the structure down instead of the
   address of the function.

If you aren't in control of the call sites then add the
structure to a linked list on a thread-local variable.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Madhavan T. Venkataraman Aug. 4, 2020, 2:48 p.m. UTC | #42
On 8/4/20 9:33 AM, David Laight wrote:
>>> If you look at the libffi reference patch I have included, the architecture
>>> specific changes to use trampfd just involve a single C function call to
>>> a common code function.
> No idea what libffi is, but it must surely be simpler to
> rewrite it to avoid nested function definitions.

Sorry if I wasn't clear.

libffi is a separate use case and GCC nested functions is a separate one.
libffi is not used to solve the nested function stuff.

For nested functions, GCC generates trampoline code and arranges to
place it on the stack and execute it.

I agree with your other points about nested function implementation.

Madhavan
> Or find a book from the 1960s on how to do recursive
> calls and nested functions in FORTRAN-IV.
>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
Madhavan T. Venkataraman Aug. 4, 2020, 3:46 p.m. UTC | #43
Hey Mark,

I am working on putting together an improved definition of trampfd per
Andy's comment. I will try to address your comments in that improved
definition. Once I send that out, I will respond to your emails as well.

Thanks.

Madhavan

On 8/4/20 8:55 AM, Mark Rutland wrote:
> On Mon, Aug 03, 2020 at 12:58:04PM -0500, Madhavan T. Venkataraman wrote:
>> On 7/31/20 1:31 PM, Mark Rutland wrote:
>>> On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
>>>> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
>>>>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
>>>>> <madvenka@linux.microsoft.com> wrote:
>>>>> When the kernel generates the code for a trampoline, it can hard code data values
>>>> in the generated code itself so it does not need PC-relative data referencing.
>>>>
>>>> And, for ISAs that do support the large offset, we do have to implement and
>>>> maintain the code page stuff for different ISAs for each application and library
>>>> if we did not use trampfd.
>>> Trampoline code is architecture specific today, so I don't see that as a
>>> major issue. Common structural bits can probably be shared even if the
>>> specifid machine code cannot.
>> True. But an implementor may prefer a standard mechanism provided by
>> the kernel so all of his architectures can be supported easily with less
>> effort.
>>
>> If you look at the libffi reference patch I have included, the architecture
>> specific changes to use trampfd just involve a single C function call to
>> a common code function.
> Sure but in addition to that each architecture backend had to define a
> set of arguments to that. I view the C function is analagous to the
> "common structural bits".
>
> I appreciate that your patch is small today (and architectures seem to
> largely align on what they need), but I don't think it's necessarily
> true that things will remain so simple as architecture are extended and
> their calling conventions evolve, and I also don't think it's clear that
> this will work for more complex cases elsewhere.
>
> [...]
>
>>>> With the user level trampoline table approach, the data part of the trampoline table
>>>> can be hacked by an attacker if an application has a vulnerability. Specifically, the
>>>> target PC can be altered to some arbitrary location. Trampfd implements an
>>>> "Allowed PCS" context. In the libffi changes, I have created a read-only array of
>>>> all ABI handlers used in closures for each architecture. This read-only array
>>>> can be used to restrict the PC values for libffi trampolines to prevent hacking.
>>>>
>>>> To generalize, we can implement security rules/features if the trampoline
>>>> object is in the kernel.
>>> I don't follow this argument. If it's possible to statically define that
>>> in the kernel, it's also possible to do that in userspace without any
>>> new kernel support.
>> It is not statically defined in the kernel.
>>
>> Let us take the libffi example. In the 64-bit X86 arch code, there are 3
>> ABI handlers:
>>
>>     ffi_closure_unix64_sse
>>     ffi_closure_unix64
>>     ffi_closure_win64
>>
>> I could create an "Allowed PCs" context like this:
>>
>> struct my_allowed_pcs {
>>     struct trampfd_values    pcs;
>>     __u64                             pc_values[3];
>> };
>>
>> const struct my_allowed_pcs    my_allowed_pcs = {
>>     { 3, 0 },
>>     (uintptr_t) ffi_closure_unix64_sse,
>>     (uintptr_t) ffi_closure_unix64,
>>     (uintptr_t) ffi_closure_win64,
>> };
>>
>> I have created a read-only array of allowed ABI handlers that closures use.
>>
>> When I set up the context for a closure trampoline, I could do this:
>>
>>     pwrite(trampfd, &my_allowed_pcs, sizeof(my_allowed_pcs), TRAMPFD_ALLOWED_PCS_OFFSET);
>>    
>> This copies the array into the trampoline object in the kernel.
>> When the register context is set for the trampoline, the kernel checks
>> the PC register value against allowed PCs.
>>
>> Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only
>> permitted target PCs enforced by the kernel are the ABI handlers.
> Sorry, when I said "statically define" meant when you knew legitimate
> targets ahead of time when you create the trampoline (i.e. whether you
> could enumerate those and know they would not change dynamically).
>
> My point was that you can achieve the same in userspace if the
> trampoline and array of legitimate targets are in read-only memory,
> without having to trap to the kernel.
>
> I think the key point here is that an adversary must be prevented from
> altering a trampoline and any associated metadata, and I think that
> there are ways of achieving that without having to trap into the kernel,
> and without the kernel having to be intimately aware of the calling
> conventions used in userspace.
>
> [...]
>
>>>> Trampfd is a framework that can be used to implement multiple things. May be,
>>>> a few of those things can also be implemented in user land itself. But I think having
>>>> just one mechanism to execute dynamic code objects is preferable to having
>>>> multiple mechanisms not standardized across all applications.
>>> In abstract, having a common interface sounds nice, but in practice
>>> elements of this are always architecture-specific (e.g. interactiosn
>>> with HW CFI), and that common interface can result in more pain as it
>>> doesn't fit naturally into the context that ISAs were designed for (e.g. 
>>> where control-flow instructions are extended with new semantics).
>> In the case of trampfd, the code generation is indeed architecture
>> specific. But that is in the kernel. The application is not affected by it.
> As an ABI detail, applications are *definitely* affected by this, and it
> is wrong to suggest they are not even if you don't have a specific case
> in mind today. As this forms a contract between userspace and the kernel
> it's overly simplistic to say that it's the kernel's problem
>
> For example, in the case of BTI on arm64, what should the trampoline
> set PSTATE.BTYPE to? Different use-cases *will* want different values,
> and not necessarily the value of PSTATE at the instant the call to the
> trampoline was made. In the case of libffi specifically using the
> original value of PSTATE.BTYPE probably is sound, but other code
> sequences may need to restrict/broaden or entirely change that.
>
>> Again, referring to the libffi reference patch, I have defined wrapper
>> functions for trampfd in common code. The architecture specific code
>> in libffi only calls the set_context function defined in common code.
>> Even this is required only because register names are specific to each
>> architecture and the target PC (to the ABI handler) is specific to
>> each architecture-ABI combo.
>>
>>> It also meass that you can't share the rough approach across OSs which
>>> do not implement an identical mechanism, so for code abstracting by ISA
>>> first, then by platform/ABI, there isn't much saving.
>> Why can you not share the same approach across OSes? In fact,
>> I have tried to design it so that other OSes can use the same
>> mechanism.
> Sure, but where they *don't*, you must fall back to the existing
> purely-userspace mechanisms, and so a codebase now has the burden of
> maintaining two distinct mechanisms.
>
> Whereas if there's a way of doing this in userspace with (stronger)
> enforcement of memory permissions the trampoline code can be common for
> when this is present or absent, which is much easier for a codebase rto
> maintain, and could make use of weaker existing mechanisms to improve
> the situation on systems without the new functionality.
>
> Thanks,
> Mark.
Madhavan T. Venkataraman Aug. 6, 2020, 5:26 p.m. UTC | #44
Thanks for the lively discussion. I have tried to answer some of the
comments below.

On 8/4/20 9:30 AM, Mark Rutland wrote:
>
>> So, the context is - if security settings in a system disallow a page to have
>> both write and execute permissions, how do you allow the execution of
>> genuine trampolines that are runtime generated and placed in a data
>> page or a stack page?
> There are options today, e.g.
>
> a) If the restriction is only per-alias, you can have distinct aliases
>    where one is writable and another is executable, and you can make it
>    hard to find the relationship between the two.
>
> b) If the restriction is only temporal, you can write instructions into
>    an RW- buffer, transition the buffer to R--, verify the buffer
>    contents, then transition it to --X.
>
> c) You can have two processes A and B where A generates instrucitons into
>    a buffer that (only) B can execute (where B may be restricted from
>    making syscalls like write, mprotect, etc).

The general principle of the mitigation is W^X. I would argue that
the above options are violations of the W^X principle. If they are
allowed today, they must be fixed. And they will be. So, we cannot
rely on them.

a) This requires a remap operation. Two mappings point to the same
     physical page. One mapping has W and the other one has X. This
     is a violation of W^X.

b) This is again a violation. The kernel should refuse to give execute
     permission to a page that was writeable in the past and refuse to
     give write permission to a page that was executable in the past.

c) This is just a variation of (a).

In general, the problem with user-level methods to map and execute
dynamic code is that the kernel cannot tell if a genuine application is
using them or an attacker is using them or piggy-backing on them.
If a security subsystem blocks all user-level methods for this reason,
we need a kernel mechanism to deal with the problem.

The kernel mechanism is not to be a backdoor. It is there to define
ways in which safe dynamic code can be executed.

I admit I have to provide more proof that my API and framework can
cover different cases. So, that is what I am doing now. I am in the process
of identifying other examples (per Andy's comment) and attempting to
show that this API and framework can address them. It will take a little time.


>>
>> IIUC, you are suggesting that the user hands the kernel a code fragment
>> and requests it to be placed in an r-x page, correct? However, the
>> kernel cannot trust any code given to it by the user. Nor can it scan any
>> piece of code and reliably decide if it is safe or not.
> Per that same logic the kernel cannot trust trampfd creation calls to be
> legitimate as the adversary could mess with the arguments. It doesn't
> matter if the kernel's codegen is trustworthy if it's potentially driven
> by an adversary.

That is not true. IMO, this is not a deficiency in trampfd. This is
something that is there even for regular system calls. For instance,
the write() system call will faithfully write out a buffer to a file
even if the buffer contents have been hacked by an attacker.
A system call can perform certain checks on incoming arguments.
But it cannot tell if a hacker has modified them.

So, there are two aspects in dynamic code that I am considering -
data and code. I submit that the data part can be hacked if an
application has a vulnerability such as buffer overflow. I don't see
how we can ever help that.

So, I am focused on the code generation part. Not all dynamic code
is the same. They have different degrees of trust.

Off the top of my head, I have tried to identify some examples
where we can have more trust on dynamic code and have the kernel
permit its execution.

1. If the kernel can do the job, then that is one safe way. Here, the kernel
    is the code. There is no code generation involved. This is what I
    have presented in the patch series as the first cut.

2. If the kernel can generate the code, then that code has a measure
    of trust. For trampolines, I agreed to do this for performance.

3. If the code resides in a signed file, then we know that it comes from
    an known source and it was generated at build time. So, it is not
    hacker generated. So, there is a measure of trust.

    This is not just program text. This could also be a buffer that contains
    trampoline code that resides in the read-only data section of a binary.

4. If the code resides in a signed file and is emulated (e.g. by QEMU)
    and we generate code for dynamic binary translation, we should
    be able to do that provided the code generator itself is not suspect.
    See the next point.   

5. The above are examples of actual machine code or equivalent.
    We could also have source code from which we generate machine
    code. E.g., JIT code from Java byte code. In this case, if the source
   code is in a signed file, we have a measure of trust on the source.
   If the kernel uses its own trusted code generator to generate the
   object code from the source code, then that object code has a
   measure of trust.

Anyway, these are just examples. The principle is - if we can identify
dynamic code that has a certain measure of trust, can the kernel
permit their execution?

All other code that cannot really be trusted by the kernel cannot be
executed safely (unless we find some safe and efficient way to
sandbox such code and limit the effects of the code to within
the sandbox). This is outside the scope of what I am doing.

>> So, the problem of executing dynamic code when security settings are
>> restrictive cannot be solved in userland. The only option I can think of is
>> to have the kernel provide support for dynamic code. It must have one
>> or more safe, trusted code generation components and an API to use
>> the components.
>>
>> My goal is to introduce an API and start off by supporting simple, regular
>> trampolines that are widely used. Then, evolve the feature over a period
>> of time to include other forms of dynamic code such as JIT code.
> I think that you're making a leap to this approach without sufficient
> justification that it actually solves the problem, and I believe that
> there will be ABI issues with this approach which can be sidestepped by
> other potential approaches.
>
> Taking a step back, I think it's necessary to better describe the
> problem and constraints that you believe apply before attempting to
> justify any potential solution.

I totally agree that more justification is needed and I am working on it.

As I have mentioned above, I intend to have the kernel generate code
only if the code generation is simple enough. For more complicated cases,
I plan to use a user-level code generator that is for exclusive kernel use.
I have yet to work out the details on how this would work. Need time.

>
> [...]
>
>>
>> 1. Create a trampoline by calling trampfd_create()
>> 2. Set the register and/or stack contexts for the trampoline.
>> 3. mmap() the trampoline to get an address
>> 4a. Retrieve the register and stack context for the trampoline from the
>>       kernel and check if anything has been altered. If yes, abort.
>> 4b. Invoke the trampoline using the address
> As above, you can also do this when using mprotect today, transitioning
> the buffer RWX -> R-- -> R-X. If you're worried about subsequent
> modification via an alias, a sealed memfd would work assuming that can
> be mapped R-X.

This is a violation of W^X and the security subsystem must be fixed
if it permits it.

> This approach is applicable to trampfd, but it isn't a specific benefit
> of trampfd.
>
> [...] 
>
>>>> - In the future, if the kernel can be enhanced to use a safe code
>>>>   generation component, that code can be placed in the trampoline mapping
>>>>   pages. Then, the trampoline invocation does not have to incur a trip
>>>>   into the kernel.
>>>>
>>>> - Also, if the kernel can be enhanced to use a safe code generation
>>>>   component, other forms of dynamic code such as JIT code can be
>>>>   addressed by the trampfd framework.
>>> I don't see why it's necessary for the kernel to generate code at all.
>>> If the trampfd creation requests can be trusted, what prevents trusting
>>> a sealed set of instructions generated in userspace?
>> Let us consider a system in which:
>>     - a process is not permitted to have pages with both write and execute
>>     - a process is not permitted to map any file as executable unless it
>>       is properly signed. In other words, cryptographically verified.
>>
>> Then, the process cannot execute any code that is runtime generated.
>> That includes trampolines. Only trampoline code that is part of program
>> text at build time would be permitted to execute.
>>
>> In this scenario, trampfd requests are coming from signed code. So, they
>> are trusted by the kernel. But trampoline code could be dynamically generated.
>> The kernel will not trust it.
> I think this a very hand-wavy argument, as it suggests that generated
> code is not trusted, but what is effectively a generated bytecode is.
> If certain codegen can be trusted, then we can add mechanisms to permit
> the results of this to be mapped r-x. If that is not possible, then the
> same argument says that trampfd requests cannot be trusted.

There is certainly an extra measure of trust in code that is in
signature verified files as compared to code that is generated
on the fly. At least, we know that the place from which we get
that code is known and the file was generated at build time
and not hacker generated. Such files could still contain a vulnerability.
But because these files are maintained by a known source, chances
are that there is nothing malicious in them.

Thanks.

Madhavan
Pavel Machek Aug. 8, 2020, 10:17 p.m. UTC | #45
Hi!

> Thanks for the lively discussion. I have tried to answer some of the
> comments below.

> > There are options today, e.g.
> >
> > a) If the restriction is only per-alias, you can have distinct aliases
> >    where one is writable and another is executable, and you can make it
> >    hard to find the relationship between the two.
> >
> > b) If the restriction is only temporal, you can write instructions into
> >    an RW- buffer, transition the buffer to R--, verify the buffer
> >    contents, then transition it to --X.
> >
> > c) You can have two processes A and B where A generates instrucitons into
> >    a buffer that (only) B can execute (where B may be restricted from
> >    making syscalls like write, mprotect, etc).
> 
> The general principle of the mitigation is W^X. I would argue that
> the above options are violations of the W^X principle. If they are
> allowed today, they must be fixed. And they will be. So, we cannot
> rely on them.

Would you mind describing your threat model?

Because I believe you are using model different from everyone else.

In particular, I don't believe b) is a problem or should be fixed.

I'll add d), application mmaps a file(R--), and uses write syscall to change
trampolines in it.

> b) This is again a violation. The kernel should refuse to give execute
> ???????? permission to a page that was writeable in the past and refuse to
> ???????? give write permission to a page that was executable in the past.

Why?

										Pavel
Madhavan T. Venkataraman Aug. 10, 2020, 5:34 p.m. UTC | #46
Resending because of mailer problems. Some of the recipients did not receive
my email. I apologize. Sigh.

Here is a redefinition of trampfd based on review comments.

I wanted to address dynamic code in 3 different ways:

    Remove the need for dynamic code where possible
    --------------------------------------------------------------------

    If the kernel itself can perform the work of some dynamic code, then
    the code can be replaced by the kernel.

    This is what I implemented in the patchset. But reviewers objected
    to the performance impact. One trip to the kernel was needed for each
    trampoline invocation. So, I have decided to defer this approach.

    Convert dynamic code to static code where possible
    ----------------------------------------------------------------------

    This is possible with help from the kernel. This has no performance
    impact and can be used in libffi, GCC nested functions, etc. I have
    described the approach below.

    Deal with code generation
    -----------------------------------

    For cases like generating JIT code from Java byte code, I wanted to
    establish a framework. However, reviewers felt that details are missing.

    Should the kernel generate code or should it use a user-level code generator?
    How do you make sure that a user level code generator can be trusted?
    How would the communication work? ABI details? Architecture support?
    Support for different types - JIT, DBT, etc?

    I have come to the conclusion that this is best done separately.

My main interest is to provide a way to convert dynamic code such as
trampolines to static code without any special architecture support.
This can be done with the kernel's help. Any code that gets written in
the future can conform to this as well.

So, in version 2 of the Trampfd RFC, I would like to simplify trampfd and
just address item 2. I will reimplement the support in libffi and present it.

Convert dynamic code to static code
------------------------------------------------

One problem with dynamic code is that it cannot be verified or authenticated
by the kernel. The kernel cannot tell the difference between genuine dynamic
code and an attacker's code. Where possible, dynamic code should be converted
to static code and placed in the text segment of a binary file. This allows
the kernel to verify the code by verifying the signature of the file.

The other problem is using user-level methods to load and execute dynamic code
can potentially be exploited by an attacker to inject his code and have it be
executed. To prevent this, a system may enforce W^X. If W^X is enforced
properly, genuine dynamic code will not be able to run. This is another
reason to convert dynamic code to static code.

The issue in converting dynamic code to static code is that the data is
dynamic. The code does not know before hand where the data is going to be
at runtime.

Some architectures support PC-relative data references. So, if you co-locate
code and data, then the code can find the data at runtime. But this is not
supported on all architectures. When supported, there may be limitations to
deal with. Plus you have to take the trouble to co-locate code and data.
And, to deal with W^X, code and data need to be in different pages.

All architectures must be supported without any limitations. Fortunately,
the kernel can solve this problem quite easily. I suggest the following:

Convert dynamic code to static code like this:

    - Decide which register should point to the data that the code needs.
      Call it register R.

    - Write the static code assuming that R already points to the data.

    - Use trampfd and pass the following to the kernel:

        - pointers to the code and data
        - the name of the register R

The kernel will write the following instructions in a trampoline page
mapped into the caller's address space with R-X.

    - Load the data address in register R
    - Jump to the static code

Basically, the kernel provides a trampoline to jump to the user's code
and returns the kernel-provided trampoline's address to the user.

It is trivial to implement a trampoline table in the trampoline page to
conserve memory.

Issues raised previously
-------------------------------

I believe that the following issues that were raised by reviewers is not
a problem in this scheme. Please rereview.

    - Florian mentioned the libffi trampoline table. Trampoline tables can be
      implemented in this scheme easily.

    - Florian mentioned stack unwinders. I am not an expert on unwinders.
      But I don't see an issue with unwinders.

    - Mark Rutland mentioned Intel's CET and CFI. Don't see a problem there.

    - Mark Rutland mentioned PAC+BTI on ARM64. Don't see a problem there.

If I have missed addressing any previously raised issue, I apologize.
Please let me know.

Thanks!

Madhavan
Madhavan T. Venkataraman Aug. 11, 2020, 12:41 p.m. UTC | #47
On 8/8/20 5:17 PM, Pavel Machek wrote:
> Hi!
> 
>> Thanks for the lively discussion. I have tried to answer some of the
>> comments below.
> 
>>> There are options today, e.g.
>>>
>>> a) If the restriction is only per-alias, you can have distinct aliases
>>>    where one is writable and another is executable, and you can make it
>>>    hard to find the relationship between the two.
>>>
>>> b) If the restriction is only temporal, you can write instructions into
>>>    an RW- buffer, transition the buffer to R--, verify the buffer
>>>    contents, then transition it to --X.
>>>
>>> c) You can have two processes A and B where A generates instrucitons into
>>>    a buffer that (only) B can execute (where B may be restricted from
>>>    making syscalls like write, mprotect, etc).
>>
>> The general principle of the mitigation is W^X. I would argue that
>> the above options are violations of the W^X principle. If they are
>> allowed today, they must be fixed. And they will be. So, we cannot
>> rely on them.
> 
> Would you mind describing your threat model?
> 
> Because I believe you are using model different from everyone else.
> 
> In particular, I don't believe b) is a problem or should be fixed.

It is a problem because a kernel that implements W^X properly
will not allow it. It has no idea what has been done in userland.
It has no idea that the user has checked and verified the buffer
contents after transitioning the page to R--.


> 
> I'll add d), application mmaps a file(R--), and uses write syscall to change
> trampolines in it.
> 

No matter how you do it, these are all user-level methods that can be
hacked. The kernel cannot be sure that an attacker's code has
not found its way into the file.

>> b) This is again a violation. The kernel should refuse to give execute
>> ???????? permission to a page that was writeable in the past and refuse to
>> ???????? give write permission to a page that was executable in the past.
> 
> Why?

I don't know about the latter part. I guess I need to think about it.
But the former is valid. When a page is RW-, a hacker could hack the
page. Then it does not matter that the page is transitioned to R--.
Again, the kernel cannot be sure that the user has verified the contents
after R--.

IMO, W^X needs to be enforced temporally as well.

Madhavan
Pavel Machek Aug. 11, 2020, 1:08 p.m. UTC | #48
Hi!

> >> Thanks for the lively discussion. I have tried to answer some of the
> >> comments below.
> > 
> >>> There are options today, e.g.
> >>>
> >>> a) If the restriction is only per-alias, you can have distinct aliases
> >>>    where one is writable and another is executable, and you can make it
> >>>    hard to find the relationship between the two.
> >>>
> >>> b) If the restriction is only temporal, you can write instructions into
> >>>    an RW- buffer, transition the buffer to R--, verify the buffer
> >>>    contents, then transition it to --X.
> >>>
> >>> c) You can have two processes A and B where A generates instrucitons into
> >>>    a buffer that (only) B can execute (where B may be restricted from
> >>>    making syscalls like write, mprotect, etc).
> >>
> >> The general principle of the mitigation is W^X. I would argue that
> >> the above options are violations of the W^X principle. If they are
> >> allowed today, they must be fixed. And they will be. So, we cannot
> >> rely on them.
> > 
> > Would you mind describing your threat model?
> > 
> > Because I believe you are using model different from everyone else.
> > 
> > In particular, I don't believe b) is a problem or should be fixed.
> 
> It is a problem because a kernel that implements W^X properly
> will not allow it. It has no idea what has been done in userland.
> It has no idea that the user has checked and verified the buffer
> contents after transitioning the page to R--.

No, it is not a problem. W^X is designed to protect from attackers
doing buffer overflows, not attackers doing arbitrary syscalls.

Best regards,
									Pavel
Madhavan T. Venkataraman Aug. 11, 2020, 3:54 p.m. UTC | #49
On 8/11/20 8:08 AM, Pavel Machek wrote:
> Hi!
> 
>>>> Thanks for the lively discussion. I have tried to answer some of the
>>>> comments below.
>>>
>>>>> There are options today, e.g.
>>>>>
>>>>> a) If the restriction is only per-alias, you can have distinct aliases
>>>>>    where one is writable and another is executable, and you can make it
>>>>>    hard to find the relationship between the two.
>>>>>
>>>>> b) If the restriction is only temporal, you can write instructions into
>>>>>    an RW- buffer, transition the buffer to R--, verify the buffer
>>>>>    contents, then transition it to --X.
>>>>>
>>>>> c) You can have two processes A and B where A generates instrucitons into
>>>>>    a buffer that (only) B can execute (where B may be restricted from
>>>>>    making syscalls like write, mprotect, etc).
>>>>
>>>> The general principle of the mitigation is W^X. I would argue that
>>>> the above options are violations of the W^X principle. If they are
>>>> allowed today, they must be fixed. And they will be. So, we cannot
>>>> rely on them.
>>>
>>> Would you mind describing your threat model?
>>>
>>> Because I believe you are using model different from everyone else.
>>>
>>> In particular, I don't believe b) is a problem or should be fixed.
>>
>> It is a problem because a kernel that implements W^X properly
>> will not allow it. It has no idea what has been done in userland.
>> It has no idea that the user has checked and verified the buffer
>> contents after transitioning the page to R--.
> 
> No, it is not a problem. W^X is designed to protect from attackers
> doing buffer overflows, not attackers doing arbitrary syscalls.
> 

Hey Pavel,

You are correct. The W^X implementation today still has some holes.
IIUC, the principle of W^X is - user should not be able to (W) write code
into a page and use some trick to get it to (X) execute. So, what I
was trying to say was that the W^X principle is not implemented
completely today.

Mark Rutland mentioned some other tricks as well which are being used
today.

For instance, Microsoft has submitted this proposal:

 https://microsoft.github.io/ipe/

IPE is an LSM. In this proposal, only mappings that are backed by a
signature verified file can have execute permissions. This means that
all anonymous page based tricks will fail. And, file mapping based
tricks will fail as well when temporary files are used to load code
and mmap(). That is the intent.

Thanks!

Madhavan
Madhavan T. Venkataraman Aug. 11, 2020, 9:12 p.m. UTC | #50
I am working on version 2 of trampfd. Will send it out soon.

Thanks for all the comments so far!

Madhavan

On 8/10/20 12:34 PM, Madhavan T. Venkataraman wrote:
> Resending because of mailer problems. Some of the recipients did not receive
> my email. I apologize. Sigh.
> 
> Here is a redefinition of trampfd based on review comments.
> 
> I wanted to address dynamic code in 3 different ways:
> 
>     Remove the need for dynamic code where possible
>     --------------------------------------------------------------------
> 
>     If the kernel itself can perform the work of some dynamic code, then
>     the code can be replaced by the kernel.
> 
>     This is what I implemented in the patchset. But reviewers objected
>     to the performance impact. One trip to the kernel was needed for each
>     trampoline invocation. So, I have decided to defer this approach.
> 
>     Convert dynamic code to static code where possible
>     ----------------------------------------------------------------------
> 
>     This is possible with help from the kernel. This has no performance
>     impact and can be used in libffi, GCC nested functions, etc. I have
>     described the approach below.
> 
>     Deal with code generation
>     -----------------------------------
> 
>     For cases like generating JIT code from Java byte code, I wanted to
>     establish a framework. However, reviewers felt that details are missing.
> 
>     Should the kernel generate code or should it use a user-level code generator?
>     How do you make sure that a user level code generator can be trusted?
>     How would the communication work? ABI details? Architecture support?
>     Support for different types - JIT, DBT, etc?
> 
>     I have come to the conclusion that this is best done separately.
> 
> My main interest is to provide a way to convert dynamic code such as
> trampolines to static code without any special architecture support.
> This can be done with the kernel's help. Any code that gets written in
> the future can conform to this as well.
> 
> So, in version 2 of the Trampfd RFC, I would like to simplify trampfd and
> just address item 2. I will reimplement the support in libffi and present it.
> 
> Convert dynamic code to static code
> ------------------------------------------------
> 
> One problem with dynamic code is that it cannot be verified or authenticated
> by the kernel. The kernel cannot tell the difference between genuine dynamic
> code and an attacker's code. Where possible, dynamic code should be converted
> to static code and placed in the text segment of a binary file. This allows
> the kernel to verify the code by verifying the signature of the file.
> 
> The other problem is using user-level methods to load and execute dynamic code
> can potentially be exploited by an attacker to inject his code and have it be
> executed. To prevent this, a system may enforce W^X. If W^X is enforced
> properly, genuine dynamic code will not be able to run. This is another
> reason to convert dynamic code to static code.
> 
> The issue in converting dynamic code to static code is that the data is
> dynamic. The code does not know before hand where the data is going to be
> at runtime.
> 
> Some architectures support PC-relative data references. So, if you co-locate
> code and data, then the code can find the data at runtime. But this is not
> supported on all architectures. When supported, there may be limitations to
> deal with. Plus you have to take the trouble to co-locate code and data.
> And, to deal with W^X, code and data need to be in different pages.
> 
> All architectures must be supported without any limitations. Fortunately,
> the kernel can solve this problem quite easily. I suggest the following:
> 
> Convert dynamic code to static code like this:
> 
>     - Decide which register should point to the data that the code needs.
>       Call it register R.
> 
>     - Write the static code assuming that R already points to the data.
> 
>     - Use trampfd and pass the following to the kernel:
> 
>         - pointers to the code and data
>         - the name of the register R
> 
> The kernel will write the following instructions in a trampoline page
> mapped into the caller's address space with R-X.
> 
>     - Load the data address in register R
>     - Jump to the static code
> 
> Basically, the kernel provides a trampoline to jump to the user's code
> and returns the kernel-provided trampoline's address to the user.
> 
> It is trivial to implement a trampoline table in the trampoline page to
> conserve memory.
> 
> Issues raised previously
> -------------------------------
> 
> I believe that the following issues that were raised by reviewers is not
> a problem in this scheme. Please rereview.
> 
>     - Florian mentioned the libffi trampoline table. Trampoline tables can be
>       implemented in this scheme easily.
> 
>     - Florian mentioned stack unwinders. I am not an expert on unwinders.
>       But I don't see an issue with unwinders.
> 
>     - Mark Rutland mentioned Intel's CET and CFI. Don't see a problem there.
> 
>     - Mark Rutland mentioned PAC+BTI on ARM64. Don't see a problem there.
> 
> If I have missed addressing any previously raised issue, I apologize.
> Please let me know.
> 
> Thanks!
> 
> Madhavan
> 
>
Mark Rutland Aug. 12, 2020, 10:06 a.m. UTC | #51
On Thu, Aug 06, 2020 at 12:26:02PM -0500, Madhavan T. Venkataraman wrote:
> Thanks for the lively discussion. I have tried to answer some of the
> comments below.
> 
> On 8/4/20 9:30 AM, Mark Rutland wrote:
> >
> >> So, the context is - if security settings in a system disallow a page to have
> >> both write and execute permissions, how do you allow the execution of
> >> genuine trampolines that are runtime generated and placed in a data
> >> page or a stack page?
> > There are options today, e.g.
> >
> > a) If the restriction is only per-alias, you can have distinct aliases
> >    where one is writable and another is executable, and you can make it
> >    hard to find the relationship between the two.
> >
> > b) If the restriction is only temporal, you can write instructions into
> >    an RW- buffer, transition the buffer to R--, verify the buffer
> >    contents, then transition it to --X.
> >
> > c) You can have two processes A and B where A generates instrucitons into
> >    a buffer that (only) B can execute (where B may be restricted from
> >    making syscalls like write, mprotect, etc).
> 
> The general principle of the mitigation is W^X. I would argue that
> the above options are violations of the W^X principle. If they are
> allowed today, they must be fixed. And they will be. So, we cannot
> rely on them.

Hold on.

Contemporary W^X means that a given virtual alias cannot be writeable
and executeable simultaneously, permitting (a) and (b). If you read the
references on the Wikipedia page for W^X you'll see the OpenBSD 3.3
release notes and related presentation make this clear, and further they
expect (b) to occur with JITS flipping W/X with mprotect().

Please don't conflate your assumed stronger semantics with the general
principle. It not matching you expectations does not necessarily mean
that it is wrong.

If you want a stronger W^X semantics, please refer to this specifically
with a distinct name.

> a) This requires a remap operation. Two mappings point to the same
>      physical page. One mapping has W and the other one has X. This
>      is a violation of W^X.
> 
> b) This is again a violation. The kernel should refuse to give execute
>      permission to a page that was writeable in the past and refuse to
>      give write permission to a page that was executable in the past.
> 
> c) This is just a variation of (a).

As above, this is not true.

If you have a rationale for why this is desirable or necessary, please
justify that before using this as justification for additional features.

> In general, the problem with user-level methods to map and execute
> dynamic code is that the kernel cannot tell if a genuine application is
> using them or an attacker is using them or piggy-backing on them.

Yes, and as I pointed out the same is true for trampfd unless you can
somehow authenticate the calls are legitimate (in both callsite and the
set of arguments), and I don't see any reasonable way of doing that.

If you relax your threat model to an attacker not being able to make
arbitrary syscalls, then your suggestion that userspace can perorm
chceks between syscalls may be sufficient, but as I pointed out that's
equally true for a sealed memfd or similar.

> Off the top of my head, I have tried to identify some examples
> where we can have more trust on dynamic code and have the kernel
> permit its execution.
> 
> 1. If the kernel can do the job, then that is one safe way. Here, the kernel
>     is the code. There is no code generation involved. This is what I
>     have presented in the patch series as the first cut.

This is sleight-of-hand; it doesn't matter where the logic is performed
if the power is identical. Practically speaking this is equivalent to
some dynamic code generation.

I think that it's misleading to say that because the kernel emulates
something it is safe when the provenance of the syscall arguments cannot
be verified.

[...]

> Anyway, these are just examples. The principle is - if we can identify
> dynamic code that has a certain measure of trust, can the kernel
> permit their execution?

My point generally is that the kernel cannot identify this, and if
usrspace code is trusted to dynamically generate trampfd arguments it
can equally be trusted to dyncamilly generate code.

[...]

> As I have mentioned above, I intend to have the kernel generate code
> only if the code generation is simple enough. For more complicated cases,
> I plan to use a user-level code generator that is for exclusive kernel use.
> I have yet to work out the details on how this would work. Need time.

This reads to me like trampfd is only dealing with a few special cases
and we know that we need a more general solution.

I hope I am mistaken, but I get the strong impression that you're trying
to justify your existing solution rather than trying to understand the
problem space.

To be clear, my strong opinion is that we should not be trying to do
this sort of emulation or code generation within the kernel. I do think
it's worthwhile to look at mechanisms to make it harder to subvert
dynamic userspace code generation, but I think the code generation
itself needs to live in userspace (e.g. for ABI reasons I previously
mentioned).

Mark.
Madhavan T. Venkataraman Aug. 12, 2020, 6:47 p.m. UTC | #52
On 8/12/20 5:06 AM, Mark Rutland wrote:
> [..]
>>
>> The general principle of the mitigation is W^X. I would argue that
>> the above options are violations of the W^X principle. If they are
>> allowed today, they must be fixed. And they will be. So, we cannot
>> rely on them.
> 
> Hold on.
> 
> Contemporary W^X means that a given virtual alias cannot be writeable
> and executeable simultaneously, permitting (a) and (b). If you read the
> references on the Wikipedia page for W^X you'll see the OpenBSD 3.3
> release notes and related presentation make this clear, and further they
> expect (b) to occur with JITS flipping W/X with mprotect().
> 
> Please don't conflate your assumed stronger semantics with the general
> principle. It not matching you expectations does not necessarily mean
> that it is wrong.
> 
> If you want a stronger W^X semantics, please refer to this specifically
> with a distinct name.

OK. Fair enough. We can give a different name to the stronger requirement.
Just for the sake of this discussion and for the want of a better name,
let us call it WX2.


> 
>> a) This requires a remap operation. Two mappings point to the same
>>      physical page. One mapping has W and the other one has X. This
>>      is a violation of W^X.
>>
>> b) This is again a violation. The kernel should refuse to give execute
>>      permission to a page that was writeable in the past and refuse to
>>      give write permission to a page that was executable in the past.
>>
>> c) This is just a variation of (a).
> 
> As above, this is not true.
> 
> If you have a rationale for why this is desirable or necessary, please
> justify that before using this as justification for additional features.
> 

I already supplied the justification. Any user level method can potentially
be hijacked by an attacker for his purpose.

WX does not prevent all of the methods. We need WX2.


>> In general, the problem with user-level methods to map and execute
>> dynamic code is that the kernel cannot tell if a genuine application is
>> using them or an attacker is using them or piggy-backing on them.
> 
> Yes, and as I pointed out the same is true for trampfd unless you can
> somehow authenticate the calls are legitimate (in both callsite and the
> set of arguments), and I don't see any reasonable way of doing that.
> 

I am afraid I am not in agreement with this. If WX2 is not implemented,
an attacker can hack both code and data. If WX2 is implemented, an attacker
can only attack data. The attack surface is reduced.

Also, trampfd calls coming from code from a signed file can be authenticated.
trampfd calls coming from an attacker's generated code cannot be authenticated.


> If you relax your threat model to an attacker not being able to make
> arbitrary syscalls, then your suggestion that userspace can perorm
> chceks between syscalls may be sufficient, but as I pointed out that's
> equally true for a sealed memfd or similar.
> 

Actually, I did not suggest that userspace can perform checks. I said that
the kernel can perform checks.

User space cannot reliably perform checks between calls. A clever hacker
can cover his tracks.

In any case, the kernel has no knowledge of these checks. So, when execute
permissions are requested for a page, a properly implemented WX2 can refuse.


>> Off the top of my head, I have tried to identify some examples
>> where we can have more trust on dynamic code and have the kernel
>> permit its execution.
>>
>> 1. If the kernel can do the job, then that is one safe way. Here, the kernel
>>     is the code. There is no code generation involved. This is what I
>>     have presented in the patch series as the first cut.
> 
> This is sleight-of-hand; it doesn't matter where the logic is performed
> if the power is identical. Practically speaking this is equivalent to
> some dynamic code generation.
> 
> I think that it's misleading to say that because the kernel emulates
> something it is safe when the provenance of the syscall arguments cannot
> be verified.


I submit that there are two aspects - code and data. In one case, both
code and data can be hacked. So, an attacker can modify both code
and data. In the other case, the attacker can only modify data.
The power is not identical. The attack surface is not the same.

Most of the times, security measures are mitigations. They are not a 100%.
This approach of not allowing the user to do certain things that can be
exploited and having the kernel doing them increases our confidence.
From that perspective, the two approaches are different and it is worth
pursuing a kernel based mitigation.


> 
> [...]
> 
>> Anyway, these are just examples. The principle is - if we can identify
>> dynamic code that has a certain measure of trust, can the kernel
>> permit their execution?
> 
> My point generally is that the kernel cannot identify this, and if
> usrspace code is trusted to dynamically generate trampfd arguments it
> can equally be trusted to dyncamilly generate code.

I am afraid not. See my previous response. Ability to hack only data
gives an attacker fewer options as compared to the ability to hack
both code and data.

> 
> [...]
> 
>> As I have mentioned above, I intend to have the kernel generate code
>> only if the code generation is simple enough. For more complicated cases,
>> I plan to use a user-level code generator that is for exclusive kernel use.
>> I have yet to work out the details on how this would work. Need time.
> 
> This reads to me like trampfd is only dealing with a few special cases
> and we know that we need a more general solution.
> 
> I hope I am mistaken, but I get the strong impression that you're trying
> to justify your existing solution rather than trying to understand the
> problem space.
> 

I do understand the problem space. I wanted to address dynamic code in 3
different ways in separate phases starting from the easiest and working
my way up to the more difficult ones.

1. Remove dynamic code where possible

   If the kernel can replace user level dynamic code, then do it.
   This is what I did in version 1.

2. Replace dynamic code with static code

   Where you cannot do (1), replace dynamic code with static code with
   the kernel's help. I wanted to do this later. But I have decided to
   do this in version 2. This combined with signature verification of
   files adds a measure or trust in the code.

3. Deal with JIT, DBT, etc

   In (1) and (2), we deal with machine code. In (3), there is some source
   from which dynamic code needs to be generated using a code generator.
   E.g., JIT code from Java byte code. Here, the solution I had in mind
   had two parts:

       - Make the source more trustworthy by requiring it to be part
         of a signed file
       - Design a code generator trusted and used exclusively by the kernel

In this patchset, I wanted to lay a foundation for all 3 and attempt to
solve (1) first. Once this was in place, I wanted to do (2) and then (3).

In retrospect, I should have probably started with the big picture first
instead of starting with just item (1). But I always had the big picture
in mind. That said, I did not necessarily have all the details fleshed
out for all the phases. (3) is complex.

My focus was to define the API in a generic enough fashion so that all
3 phases can be implemented. But I realize that it is a hard sell at this
point to convince people that the API is adequate for phase 3. So,
I have decided to do (1) and (2). (3) has to be done separately with
more thought and details put into it.

Also, it may be the case that there are some examples of dynamic code
out there than can never be addressed. My goal is to try to address a
majority of the dynamic code out there.


> To be clear, my strong opinion is that we should not be trying to do
> this sort of emulation or code generation within the kernel. I do think
> it's worthwhile to look at mechanisms to make it harder to subvert
> dynamic userspace code generation, but I think the code generation
> itself needs to live in userspace (e.g. for ABI reasons I previously
> mentioned).
> 

I completely agree that the kernel should not deal with the complexities
of code generation and ABI details. My version 1 did not have any code
generation. But since a performance issue was raised, I explored the idea
of kernel code generation. To be honest, I was not really that
comfortable with the idea.

That is why I have decided to implement the second piece I had in
my plan now. This piece does not have the code generation complexities
or ABI issues. This piece can be used to solve libffi, GCC, etc.
I will still write the code in such a way that I can use the first
approach in the future if I really need it. But it will not involve any
code generation from the kernel. It will only be used for cases that
don't mind the extra trip to the kernel.

Madhavan
Mickaël Salaün Aug. 19, 2020, 6:53 p.m. UTC | #53
On 12/08/2020 12:06, Mark Rutland wrote:
> On Thu, Aug 06, 2020 at 12:26:02PM -0500, Madhavan T. Venkataraman wrote:
>> Thanks for the lively discussion. I have tried to answer some of the
>> comments below.
>>
>> On 8/4/20 9:30 AM, Mark Rutland wrote:
>>>
>>>> So, the context is - if security settings in a system disallow a page to have
>>>> both write and execute permissions, how do you allow the execution of
>>>> genuine trampolines that are runtime generated and placed in a data
>>>> page or a stack page?
>>> There are options today, e.g.
>>>
>>> a) If the restriction is only per-alias, you can have distinct aliases
>>>    where one is writable and another is executable, and you can make it
>>>    hard to find the relationship between the two.
>>>
>>> b) If the restriction is only temporal, you can write instructions into
>>>    an RW- buffer, transition the buffer to R--, verify the buffer
>>>    contents, then transition it to --X.
>>>
>>> c) You can have two processes A and B where A generates instrucitons into
>>>    a buffer that (only) B can execute (where B may be restricted from
>>>    making syscalls like write, mprotect, etc).
>>
>> The general principle of the mitigation is W^X. I would argue that
>> the above options are violations of the W^X principle. If they are
>> allowed today, they must be fixed. And they will be. So, we cannot
>> rely on them.
> 
> Hold on.
> 
> Contemporary W^X means that a given virtual alias cannot be writeable
> and executeable simultaneously, permitting (a) and (b). If you read the
> references on the Wikipedia page for W^X you'll see the OpenBSD 3.3
> release notes and related presentation make this clear, and further they
> expect (b) to occur with JITS flipping W/X with mprotect().

W^X (with "permanent" mprotect restrictions [1]) goes back to 2000 with
PaX [2] (which predates partial OpenBSD implementation from 2003).

[1] https://pax.grsecurity.net/docs/mprotect.txt
[2] https://undeadly.org/cgi?action=article;sid=20030417082752

> 
> Please don't conflate your assumed stronger semantics with the general
> principle. It not matching you expectations does not necessarily mean
> that it is wrong.
> 
> If you want a stronger W^X semantics, please refer to this specifically
> with a distinct name.
> 
>> a) This requires a remap operation. Two mappings point to the same
>>      physical page. One mapping has W and the other one has X. This
>>      is a violation of W^X.
>>
>> b) This is again a violation. The kernel should refuse to give execute
>>      permission to a page that was writeable in the past and refuse to
>>      give write permission to a page that was executable in the past.
>>
>> c) This is just a variation of (a).
> 
> As above, this is not true.
> 
> If you have a rationale for why this is desirable or necessary, please
> justify that before using this as justification for additional features.
> 
>> In general, the problem with user-level methods to map and execute
>> dynamic code is that the kernel cannot tell if a genuine application is
>> using them or an attacker is using them or piggy-backing on them.
> 
> Yes, and as I pointed out the same is true for trampfd unless you can
> somehow authenticate the calls are legitimate (in both callsite and the
> set of arguments), and I don't see any reasonable way of doing that.
> 
> If you relax your threat model to an attacker not being able to make
> arbitrary syscalls, then your suggestion that userspace can perorm
> chceks between syscalls may be sufficient, but as I pointed out that's
> equally true for a sealed memfd or similar.
> 
>> Off the top of my head, I have tried to identify some examples
>> where we can have more trust on dynamic code and have the kernel
>> permit its execution.
>>
>> 1. If the kernel can do the job, then that is one safe way. Here, the kernel
>>     is the code. There is no code generation involved. This is what I
>>     have presented in the patch series as the first cut.
> 
> This is sleight-of-hand; it doesn't matter where the logic is performed
> if the power is identical. Practically speaking this is equivalent to
> some dynamic code generation.
> 
> I think that it's misleading to say that because the kernel emulates
> something it is safe when the provenance of the syscall arguments cannot
> be verified.
> 
> [...]
> 
>> Anyway, these are just examples. The principle is - if we can identify
>> dynamic code that has a certain measure of trust, can the kernel
>> permit their execution?
> 
> My point generally is that the kernel cannot identify this, and if
> usrspace code is trusted to dynamically generate trampfd arguments it
> can equally be trusted to dyncamilly generate code.
> 
> [...]
> 
>> As I have mentioned above, I intend to have the kernel generate code
>> only if the code generation is simple enough. For more complicated cases,
>> I plan to use a user-level code generator that is for exclusive kernel use.
>> I have yet to work out the details on how this would work. Need time.
> 
> This reads to me like trampfd is only dealing with a few special cases
> and we know that we need a more general solution.
> 
> I hope I am mistaken, but I get the strong impression that you're trying
> to justify your existing solution rather than trying to understand the
> problem space.
> 
> To be clear, my strong opinion is that we should not be trying to do
> this sort of emulation or code generation within the kernel. I do think
> it's worthwhile to look at mechanisms to make it harder to subvert
> dynamic userspace code generation, but I think the code generation
> itself needs to live in userspace (e.g. for ABI reasons I previously
> mentioned).
> 
> Mark.
>
Mark Rutland Sept. 1, 2020, 3:42 p.m. UTC | #54
On Wed, Aug 19, 2020 at 08:53:42PM +0200, Mickaël Salaün wrote:
> On 12/08/2020 12:06, Mark Rutland wrote:
> > Contemporary W^X means that a given virtual alias cannot be writeable
> > and executeable simultaneously, permitting (a) and (b). If you read the
> > references on the Wikipedia page for W^X you'll see the OpenBSD 3.3
> > release notes and related presentation make this clear, and further they
> > expect (b) to occur with JITS flipping W/X with mprotect().
> 
> W^X (with "permanent" mprotect restrictions [1]) goes back to 2000 with
> PaX [2] (which predates partial OpenBSD implementation from 2003).
> 
> [1] https://pax.grsecurity.net/docs/mprotect.txt
> [2] https://undeadly.org/cgi?action=article;sid=20030417082752

Thanks for the pointers!

Mark.