Message ID | 20200728131050.24443-1-madvenka@linux.microsoft.com (mailing list archive) |
---|---|
Headers | show |
Series | Implement Trampoline File Descriptor | expand |
From: madvenka@linux.microsoft.com > Sent: 28 July 2020 14:11 ... > The kernel creates the trampoline mapping without any permissions. When > the trampoline is executed by user code, a page fault happens and the > kernel gets control. The kernel recognizes that this is a trampoline > invocation. It sets up the user registers based on the specified > register context, and/or pushes values on the user stack based on the > specified stack context, and sets the user PC to the requested target > PC. When the kernel returns, execution continues at the target PC. > So, the kernel does the work of the trampoline on behalf of the > application. Isn't the performance of this going to be horrid? If you don't care that much about performance the fixup can all be done in userspace within the fault signal handler. Since whatever you do needs the application changed why not change the implementation of nested functions to not need on-stack executable trampolines. I can think of other alternatives that don't need much more than an array of 'push constant; jump trampoline' instructions be created (all jump to the same place). You might want something to create an executable page of such instructions. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 7/28/2020 6:10 AM, madvenka@linux.microsoft.com wrote: > From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> > > Introduction > ------------ > > Trampolines are used in many different user applications. Trampoline > code is often generated at runtime. Trampoline code can also just be a > pre-defined sequence of machine instructions in a data buffer. > > Trampoline code is placed either in a data page or in a stack page. In > order to execute a trampoline, the page it resides in needs to be mapped > with execute permissions. Writable pages with execute permissions provide > an attack surface for hackers. Attackers can use this to inject malicious > code, modify existing code or do other harm. > > To mitigate this, LSMs such as SELinux may not allow pages to have both > write and execute permissions. This prevents trampolines from executing > and blocks applications that use trampolines. To allow genuine applications > to run, exceptions have to be made for them (by setting execmem, etc). > In this case, the attack surface is just the pages of such applications. > > An application that is not allowed to have writable executable pages > may try to load trampoline code into a file and map the file with execute > permissions. In this case, the attack surface is just the buffer that > contains trampoline code. However, a successful exploit may provide the > hacker with means to load his own code in a file, map it and execute it. > > LSMs (such as the IPE proposal [1]) may allow only properly signed object > files to be mapped with execute permissions. This will prevent trampoline > files from being mapped. Again, exceptions have to be made for genuine > applications. > > We need a way to execute trampolines without making security exceptions > where possible and to reduce the attack surface even further. > > Examples of trampolines > ----------------------- > > libffi (A Portable Foreign Function Interface Library): > > libffi allows a user to define functions with an arbitrary list of > arguments and return value through a feature called "Closures". > Closures use trampolines to jump to ABI handlers that handle calling > conventions and call a target function. libffi is used by a lot > of different applications. To name a few: > > - Python > - Java > - Javascript > - Ruby FFI > - Lisp > - Objective C > > GCC nested functions: > > GCC has traditionally used trampolines for implementing nested > functions. The trampoline is placed on the user stack. So, the stack > needs to be executable. > > Currently available solution > ---------------------------- > > One solution that has been proposed to allow trampolines to be executed > without making security exceptions is Trampoline Emulation. See: > > https://pax.grsecurity.net/docs/emutramp.txt > > In this solution, the kernel recognizes certain sequences of instructions > as "well-known" trampolines. When such a trampoline is executed, a page > fault happens because the trampoline page does not have execute permission. > The kernel recognizes the trampoline and emulates it. Basically, the > kernel does the work of the trampoline on behalf of the application. What prevents a malicious process from using the "well-known" trampoline to its own purposes? I expect it is obvious, but I'm not seeing it. Old eyes, I suppose. > Here, the attack surface is the buffer that contains the trampoline. > The attack surface is narrower than before. A hacker may still be able to > modify what gets loaded in the registers or modify the target PC to point > to arbitrary locations. > > Currently, the emulated trampolines are the ones used in libffi and GCC > nested functions. To my knowledge, only X86 is supported at this time. > > As noted in emutramp.txt, this is not a generic solution. For every new > trampoline that needs to be supported, new instruction sequences need to > be recognized by the kernel and emulated. And this has to be done for > every architecture that needs to be supported. > > emutramp.txt notes the following: > > "... the real solution is not in emulation but by designing a kernel API > for runtime code generation and modifying userland to make use of it." > > Trampoline File Descriptor (trampfd) > -------------------------- > > I am proposing a kernel API using anonymous file descriptors that > can be used to create and execute trampolines with the help of the > kernel. In this solution also, the kernel does the work of the trampoline. > The API is described in patch 1/4 of this patchset. I provide a > summary here: > > Trampolines commonly execute the following sequence: > > - Load some values in some registers and/or > - Push some values on the stack > - Jump to a target PC > > libffi and GCC nested function trampolines fit into this model. > > Using the kernel API, applications and libraries can: > > - Create a trampoline object > - Associate a register context with the trampoline (including > a target PC) > - Associate a stack context with the trampoline > - Map the trampoline into a process address space > - Execute the trampoline by executing at the trampoline address > > The kernel creates the trampoline mapping without any permissions. When > the trampoline is executed by user code, a page fault happens and the > kernel gets control. The kernel recognizes that this is a trampoline > invocation. It sets up the user registers based on the specified > register context, and/or pushes values on the user stack based on the > specified stack context, and sets the user PC to the requested target > PC. When the kernel returns, execution continues at the target PC. > So, the kernel does the work of the trampoline on behalf of the > application. > > In this case, the attack surface is the context buffer. A hacker may > attack an application with a vulnerability and may be able to modify the > context buffer. So, when the register or stack context is set for > a trampoline, the values may have been tampered with. From an attack > surface perspective, this is similar to Trampoline Emulation. But > with trampfd, user code can retrieve a trampoline's context from the > kernel and add defensive checks to see if the context has been > tampered with. > > As for the target PC, trampfd implements a measure called the > "Allowed PCs" context (see Advantages) to prevent a hacker from making > the target PC point to arbitrary locations. So, the attack surface is > narrower than Trampoline Emulation. > > Advantages of the Trampoline File Descriptor approach > ----------------------------------------------------- > > - trampfd is customizable. The user can specify any combination of > allowed register name-value pairs in the register context and the kernel > will set it up accordingly. This allows different user trampolines to be > converted to use trampfd. > > - trampfd allows a stack context to be set up so that trampolines that > need to push values on the user stack can do that. > > - The initial work is targeted for X86 and ARM. But the implementation > leverages small portions of existing signal delivery code. Specifically, > it uses pt_regs for setting up user registers and copy_to_user() > to push values on the stack. So, this can be very easily ported to other > architectures. > > - trampfd provides a basic framework. In the future, new trampoline types > can be implemented, new contexts can be defined, and additional rules > can be implemented for security purposes. > > - For instance, trampfd defines an "Allowed PCs" context in this initial > work. As an example, libffi can create a read-only array of all ABI > handlers for an architecture at build time. This array can be used to > set the list of allowed PCs for a trampoline. This will mean that a hacker > cannot hack the PC part of the register context and make it point to > arbitrary locations. > > - An SELinux setting called "exectramp" can be implemented along the > lines of "execmem", "execstack" and "execheap" to selectively allow the > use of trampolines on a per application basis. > > - User code can add defensive checks in the code before invoking a > trampoline to make sure that a hacker has not modified the context data. > It can do this by getting the trampoline context from the kernel and > double checking it. > > - In the future, if the kernel can be enhanced to use a safe code > generation component, that code can be placed in the trampoline mapping > pages. Then, the trampoline invocation does not have to incur a trip > into the kernel. > > - Also, if the kernel can be enhanced to use a safe code generation > component, other forms of dynamic code such as JIT code can be > addressed by the trampfd framework. > > - Trampolines can be shared across processes which can give rise to > interesting uses in the future. > > - Trampfd can be used for other purposes to extend the kernel's > functionality. > > libffi > ------ > > I have implemented my solution for libffi and provided the changes for > X86 and ARM, 32-bit and 64-bit. Here is the reference patch: > > http://linux.microsoft.com/~madvenka/libffi/libffi.txt > > If the trampfd patchset gets accepted, I will send the libffi changes > to the maintainers for a review. BTW, I have also successfully executed > the libffi self tests. > > Work that is pending > -------------------- > > - I am working on implementing an SELinux setting called "exectramp" > similar to "execmem" to allow the use of trampfd on a per application > basis. You could make a separate LSM to do these checks instead of limiting it to SELinux. Your use case, your call, of course. > > - I have a comprehensive test program to test the kernel API. I am > working on adding it to selftests. > > References > ---------- > > [1] https://microsoft.github.io/ipe/ > --- > Madhavan T. Venkataraman (4): > fs/trampfd: Implement the trampoline file descriptor API > x86/trampfd: Support for the trampoline file descriptor > arm64/trampfd: Support for the trampoline file descriptor > arm/trampfd: Support for the trampoline file descriptor > > arch/arm/include/uapi/asm/ptrace.h | 20 ++ > arch/arm/kernel/Makefile | 1 + > arch/arm/kernel/trampfd.c | 214 +++++++++++++++++ > arch/arm/mm/fault.c | 12 +- > arch/arm/tools/syscall.tbl | 1 + > arch/arm64/include/asm/ptrace.h | 9 + > arch/arm64/include/asm/unistd.h | 2 +- > arch/arm64/include/asm/unistd32.h | 2 + > arch/arm64/include/uapi/asm/ptrace.h | 57 +++++ > arch/arm64/kernel/Makefile | 2 + > arch/arm64/kernel/trampfd.c | 278 ++++++++++++++++++++++ > arch/arm64/mm/fault.c | 15 +- > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/x86/include/uapi/asm/ptrace.h | 38 +++ > arch/x86/kernel/Makefile | 2 + > arch/x86/kernel/trampfd.c | 313 +++++++++++++++++++++++++ > arch/x86/mm/fault.c | 11 + > fs/Makefile | 1 + > fs/trampfd/Makefile | 6 + > fs/trampfd/trampfd_data.c | 43 ++++ > fs/trampfd/trampfd_fops.c | 131 +++++++++++ > fs/trampfd/trampfd_map.c | 78 ++++++ > fs/trampfd/trampfd_pcs.c | 95 ++++++++ > fs/trampfd/trampfd_regs.c | 137 +++++++++++ > fs/trampfd/trampfd_stack.c | 131 +++++++++++ > fs/trampfd/trampfd_stubs.c | 41 ++++ > fs/trampfd/trampfd_syscall.c | 92 ++++++++ > include/linux/syscalls.h | 3 + > include/linux/trampfd.h | 82 +++++++ > include/uapi/asm-generic/unistd.h | 4 +- > include/uapi/linux/trampfd.h | 171 ++++++++++++++ > init/Kconfig | 8 + > kernel/sys_ni.c | 3 + > 34 files changed, 1998 insertions(+), 7 deletions(-) > create mode 100644 arch/arm/kernel/trampfd.c > create mode 100644 arch/arm64/kernel/trampfd.c > create mode 100644 arch/x86/kernel/trampfd.c > create mode 100644 fs/trampfd/Makefile > create mode 100644 fs/trampfd/trampfd_data.c > create mode 100644 fs/trampfd/trampfd_fops.c > create mode 100644 fs/trampfd/trampfd_map.c > create mode 100644 fs/trampfd/trampfd_pcs.c > create mode 100644 fs/trampfd/trampfd_regs.c > create mode 100644 fs/trampfd/trampfd_stack.c > create mode 100644 fs/trampfd/trampfd_stubs.c > create mode 100644 fs/trampfd/trampfd_syscall.c > create mode 100644 include/linux/trampfd.h > create mode 100644 include/uapi/linux/trampfd.h >
Thanks. See inline.. On 7/28/20 10:13 AM, David Laight wrote: > From: madvenka@linux.microsoft.com >> Sent: 28 July 2020 14:11 > ... >> The kernel creates the trampoline mapping without any permissions. When >> the trampoline is executed by user code, a page fault happens and the >> kernel gets control. The kernel recognizes that this is a trampoline >> invocation. It sets up the user registers based on the specified >> register context, and/or pushes values on the user stack based on the >> specified stack context, and sets the user PC to the requested target >> PC. When the kernel returns, execution continues at the target PC. >> So, the kernel does the work of the trampoline on behalf of the >> application. > Isn't the performance of this going to be horrid? It takes about the same amount of time as getpid(). So, it is one quick trip into the kernel. I expect that applications will typically not care about this extra overhead as long as they are able to run. But I agree that if there is an application that cannot tolerate this extra overhead, then it is an issue. See below for further discussion. In the libffi changes I have included in the cover letter, I have done it in such a way that trampfd is chosen when current security settings don't allow other methods such as loading trampoline code into a file and mapping it. In this case, the application can at least run with trampfd. > > If you don't care that much about performance the fixup can > all be done in userspace within the fault signal handler. I do care about performance. This is a framework to address trampolines. In this initial work, I want to establish one basic way for things to work. In the future, trampfd can be enhanced for performance. For instance, it is easy for an architecture to generate the exact instructions required to load specified registers, push specified values on the stack and jump to a target PC. The kernel can map a page with the generated code with execute permissions. In this case, the performance issue goes away. > Since whatever you do needs the application changed why > not change the implementation of nested functions to not > need on-stack executable trampolines. I kinda agree with your suggestion. But it is up to the GCC folks to change its implementation. I am trying to provide a way for their existing implementation to work in a more secure way. > I can think of other alternatives that don't need much more > than an array of 'push constant; jump trampoline' instructions > be created (all jump to the same place). > > You might want something to create an executable page of such > instructions. Agreed. And that can be done within this framework as I have mentioned above. But it is not just this trampoline type that I have implemented in this patchset. In the future, other types can be implemented and other contexts can be defined. Basically, the approach is for the user to supply a recipe to the kernel and leave it up to the kernel to do it in the best way possible. I am hoping that other forms of dynamic code can be addressed in the future using the same framework. *Purely as a hypothetical example*, a user can supply instructions in a language such as BPF that the kernel understands and have the kernel arrange for that to be executed in user context. Madhavan > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
Thanks. On 7/28/20 11:05 AM, Casey Schaufler wrote: >> In this solution, the kernel recognizes certain sequences of instructions >> as "well-known" trampolines. When such a trampoline is executed, a page >> fault happens because the trampoline page does not have execute permission. >> The kernel recognizes the trampoline and emulates it. Basically, the >> kernel does the work of the trampoline on behalf of the application. > What prevents a malicious process from using the "well-known" trampoline > to its own purposes? I expect it is obvious, but I'm not seeing it. Old > eyes, I suppose. You are quite right. As I note below, the attack surface is the buffer that contains the trampoline code. Since the kernel does check the instruction sequence, the sequence cannot be changed by a hacker. But the hacker can presumably change the register values and redirect the PC to his desired location. The assumption with trampoline emulation is that the system will have security settings that will prevent pages from having both write and execute permissions. So, a hacker cannot load his own code in a page and redirect the PC to it and execute his own code. But he can probably set the PC to point to arbitrary locations. For instance, jump to the middle of a C library function. > >> Here, the attack surface is the buffer that contains the trampoline. >> The attack surface is narrower than before. A hacker may still be able to >> modify what gets loaded in the registers or modify the target PC to point >> to arbitrary locations. ... >> Work that is pending >> -------------------- >> >> - I am working on implementing an SELinux setting called "exectramp" >> similar to "execmem" to allow the use of trampfd on a per application >> basis. > You could make a separate LSM to do these checks instead of limiting > it to SELinux. Your use case, your call, of course. OK. I will research this. Madhavan
On Tue, 28 Jul 2020, Casey Schaufler wrote: > You could make a separate LSM to do these checks instead of limiting > it to SELinux. Your use case, your call, of course. It's not limited to SELinux. This is hooked via the LSM API and implementable by any LSM (similar to execmem, execstack etc.)
On 7/28/20 12:05 PM, James Morris wrote: > On Tue, 28 Jul 2020, Casey Schaufler wrote: > >> You could make a separate LSM to do these checks instead of limiting >> it to SELinux. Your use case, your call, of course. > It's not limited to SELinux. This is hooked via the LSM API and > implementable by any LSM (similar to execmem, execstack etc.) Yes. I have an implementation that I am testing right now that defines the hook for exectramp and implements it for SELinux. That is why I mentioned SELinux. Madhavan
On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman <madvenka@linux.microsoft.com> wrote: > > Thanks. See inline.. > > On 7/28/20 10:13 AM, David Laight wrote: > > From: madvenka@linux.microsoft.com > >> Sent: 28 July 2020 14:11 > > ... > >> The kernel creates the trampoline mapping without any permissions. When > >> the trampoline is executed by user code, a page fault happens and the > >> kernel gets control. The kernel recognizes that this is a trampoline > >> invocation. It sets up the user registers based on the specified > >> register context, and/or pushes values on the user stack based on the > >> specified stack context, and sets the user PC to the requested target > >> PC. When the kernel returns, execution continues at the target PC. > >> So, the kernel does the work of the trampoline on behalf of the > >> application. > > Isn't the performance of this going to be horrid? > > It takes about the same amount of time as getpid(). So, it is > one quick trip into the kernel. I expect that applications will > typically not care about this extra overhead as long as > they are able to run. What did you test this on? A page fault on any modern x86_64 system is much, much, much, much slower than a syscall. --Andy
> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote: > > From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> > > The kernel creates the trampoline mapping without any permissions. When > the trampoline is executed by user code, a page fault happens and the > kernel gets control. The kernel recognizes that this is a trampoline > invocation. It sets up the user registers based on the specified > register context, and/or pushes values on the user stack based on the > specified stack context, and sets the user PC to the requested target > PC. When the kernel returns, execution continues at the target PC. > So, the kernel does the work of the trampoline on behalf of the > application. This is quite clever, but now I’m wondering just how much kernel help is really needed. In your series, the trampoline is an non-executable page. I can think of at least two alternative approaches, and I'd like to know the pros and cons. 1. Entirely userspace: a return trampoline would be something like: 1: pushq %rax pushq %rbc pushq %rcx ... pushq %r15 movq %rsp, %rdi # pointer to saved regs leaq 1b(%rip), %rsi # pointer to the trampoline itself callq trampoline_handler # see below You would fill a page with a bunch of these, possibly compacted to get more per page, and then you would remap as many copies as needed. The 'callq trampoline_handler' part would need to be a bit clever to make it continue to work despite this remapping. This will be *much* faster than trampfd. How much of your use case would it cover? For the inverse, it's not too hard to write a bit of asm to set all registers and jump somewhere. 2. Use existing kernel functionality. Raise a signal, modify the state, and return from the signal. This is very flexible and may not be all that much slower than trampfd. 3. Use a syscall. Instead of having the kernel handle page faults, have the trampoline code push the syscall nr register, load a special new syscall nr into the syscall nr register, and do a syscall. On x86_64, this would be: pushq %rax movq __NR_magic_trampoline, %rax syscall with some adjustment if the stack slot you're clobbering is important. Also, will using trampfd cause issues with various unwinders? I can easily imagine unwinders expecting code to be readable, although this is slowly going away for other reasons. All this being said, I think that the kernel should absolutely add a sensible interface for JITs to use to materialize their code. This would integrate sanely with LSMs and wouldn't require hacks like using files, etc. A cleverly designed JIT interface could function without seriailization IPIs, and even lame architectures like x86 could potentially avoid shootdown IPIs if the interface copied code instead of playing virtual memory games. At its very simplest, this could be: void *jit_create_code(const void *source, size_t len); and the result would be a new anonymous mapping that contains exactly the code requested. There could also be: int jittfd_create(...); that does something similar but creates a memfd. A nicer implementation for short JIT sequences would allow appending more code to an existing JIT region. On x86, an appendable JIT region would start filled with 0xCC, and I bet there's a way to materialize new code into a previously 0xcc-filled virtual page wthout any synchronization. One approach would be to start with: <some code> 0xcc 0xcc ... 0xcc and to create a whole new page like: <some code> <some more code> 0xcc ... 0xcc so that the only difference is that some code changed to some more code. Then replace the PTE to swap from the old page to the new page, and arrange to avoid freeing the old page until we're sure it's gone from all TLBs. This may not work if <some more code> spans a page boundary. The #BP fixup would zap the TLB and retry. Even just directly copying code over some 0xcc bytes almost works, but there's a nasty corner case involving instructions that fetch I$ fetch boundaries. I'm not sure to what extent I$ snooping helps. --Andy
On 7/28/20 12:16 PM, Andy Lutomirski wrote: > On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman > <madvenka@linux.microsoft.com> wrote: >> Thanks. See inline.. >> >> On 7/28/20 10:13 AM, David Laight wrote: >>> From: madvenka@linux.microsoft.com >>>> Sent: 28 July 2020 14:11 >>> ... >>>> The kernel creates the trampoline mapping without any permissions. When >>>> the trampoline is executed by user code, a page fault happens and the >>>> kernel gets control. The kernel recognizes that this is a trampoline >>>> invocation. It sets up the user registers based on the specified >>>> register context, and/or pushes values on the user stack based on the >>>> specified stack context, and sets the user PC to the requested target >>>> PC. When the kernel returns, execution continues at the target PC. >>>> So, the kernel does the work of the trampoline on behalf of the >>>> application. >>> Isn't the performance of this going to be horrid? >> It takes about the same amount of time as getpid(). So, it is >> one quick trip into the kernel. I expect that applications will >> typically not care about this extra overhead as long as >> they are able to run. > What did you test this on? A page fault on any modern x86_64 system > is much, much, much, much slower than a syscall. I sent a response to this. But the mail was returned to me. I am resending. I tested it in on a KVM guest running Ubuntu. So, when you say that a page fault is much slower, do you mean a regular page fault that is handled through the VM layer? Here is the relevant code in do_user_addr_fault(): if (unlikely(access_error(hw_error_code, vma))) { /* * If it is a user execute fault, it could be a trampoline * invocation. */ if ((hw_error_code & tflags) == tflags && trampfd_fault(vma, regs)) { up_read(&mm->mmap_sem); return; } bad_area_access_error(regs, hw_error_code, address, vma); return; } ... fault = handle_mm_fault(vma, address, flags); trampfd faults are instruction faults that go through a different code path than the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that is time consuming. Could you clarify? Thanks. Madhavan
I am working on a response to this. I will send it soon. Thanks. Madhavan On 7/28/20 12:31 PM, Andy Lutomirski wrote: >> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote: >> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> >> >> The kernel creates the trampoline mapping without any permissions. When >> the trampoline is executed by user code, a page fault happens and the >> kernel gets control. The kernel recognizes that this is a trampoline >> invocation. It sets up the user registers based on the specified >> register context, and/or pushes values on the user stack based on the >> specified stack context, and sets the user PC to the requested target >> PC. When the kernel returns, execution continues at the target PC. >> So, the kernel does the work of the trampoline on behalf of the >> application. > This is quite clever, but now I’m wondering just how much kernel help > is really needed. In your series, the trampoline is an non-executable > page. I can think of at least two alternative approaches, and I'd > like to know the pros and cons. > > 1. Entirely userspace: a return trampoline would be something like: > > 1: > pushq %rax > pushq %rbc > pushq %rcx > ... > pushq %r15 > movq %rsp, %rdi # pointer to saved regs > leaq 1b(%rip), %rsi # pointer to the trampoline itself > callq trampoline_handler # see below > > You would fill a page with a bunch of these, possibly compacted to get > more per page, and then you would remap as many copies as needed. The > 'callq trampoline_handler' part would need to be a bit clever to make > it continue to work despite this remapping. This will be *much* > faster than trampfd. How much of your use case would it cover? For > the inverse, it's not too hard to write a bit of asm to set all > registers and jump somewhere. > > 2. Use existing kernel functionality. Raise a signal, modify the > state, and return from the signal. This is very flexible and may not > be all that much slower than trampfd. > > 3. Use a syscall. Instead of having the kernel handle page faults, > have the trampoline code push the syscall nr register, load a special > new syscall nr into the syscall nr register, and do a syscall. On > x86_64, this would be: > > pushq %rax > movq __NR_magic_trampoline, %rax > syscall > > with some adjustment if the stack slot you're clobbering is important. > > > Also, will using trampfd cause issues with various unwinders? I can > easily imagine unwinders expecting code to be readable, although this > is slowly going away for other reasons. > > All this being said, I think that the kernel should absolutely add a > sensible interface for JITs to use to materialize their code. This > would integrate sanely with LSMs and wouldn't require hacks like using > files, etc. A cleverly designed JIT interface could function without > seriailization IPIs, and even lame architectures like x86 could > potentially avoid shootdown IPIs if the interface copied code instead > of playing virtual memory games. At its very simplest, this could be: > > void *jit_create_code(const void *source, size_t len); > > and the result would be a new anonymous mapping that contains exactly > the code requested. There could also be: > > int jittfd_create(...); > > that does something similar but creates a memfd. A nicer > implementation for short JIT sequences would allow appending more code > to an existing JIT region. On x86, an appendable JIT region would > start filled with 0xCC, and I bet there's a way to materialize new > code into a previously 0xcc-filled virtual page wthout any > synchronization. One approach would be to start with: > > <some code> > 0xcc > 0xcc > ... > 0xcc > > and to create a whole new page like: > > <some code> > <some more code> > 0xcc > ... > 0xcc > > so that the only difference is that some code changed to some more > code. Then replace the PTE to swap from the old page to the new page, > and arrange to avoid freeing the old page until we're sure it's gone > from all TLBs. This may not work if <some more code> spans a page > boundary. The #BP fixup would zap the TLB and retry. Even just > directly copying code over some 0xcc bytes almost works, but there's a > nasty corner case involving instructions that fetch I$ fetch > boundaries. I'm not sure to what extent I$ snooping helps. > > --Andy
On Tue, Jul 28, 2020 at 10:40 AM Madhavan T. Venkataraman <madvenka@linux.microsoft.com> wrote: > > > > On 7/28/20 12:16 PM, Andy Lutomirski wrote: > > On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman > <madvenka@linux.microsoft.com> wrote: > > Thanks. See inline.. > > On 7/28/20 10:13 AM, David Laight wrote: > > From: madvenka@linux.microsoft.com > > Sent: 28 July 2020 14:11 > > ... > > The kernel creates the trampoline mapping without any permissions. When > the trampoline is executed by user code, a page fault happens and the > kernel gets control. The kernel recognizes that this is a trampoline > invocation. It sets up the user registers based on the specified > register context, and/or pushes values on the user stack based on the > specified stack context, and sets the user PC to the requested target > PC. When the kernel returns, execution continues at the target PC. > So, the kernel does the work of the trampoline on behalf of the > application. > > Isn't the performance of this going to be horrid? > > It takes about the same amount of time as getpid(). So, it is > one quick trip into the kernel. I expect that applications will > typically not care about this extra overhead as long as > they are able to run. > > What did you test this on? A page fault on any modern x86_64 system > is much, much, much, much slower than a syscall. > > > I tested it in on a KVM guest running Ubuntu. So, when you say > that a page fault is much slower, do you mean a regular page > fault that is handled through the VM layer? Here is the relevant code > in do_user_addr_fault(): I mean that x86 CPUs have reasonably SYSCALL and SYSRET instructions (the former is used for 64-bit system calls on Linux and the latter is mostly used to return from system calls), but hardware page fault delivery and IRET (used to return from page faults) are very slow.
From: Madhavan T. Venkataraman > Sent: 28 July 2020 19:52 ... > trampfd faults are instruction faults that go through a different code path than > the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that > is time consuming. Could you clarify? Given that the expectation is a few instructions in userspace (eg to pick up the original arguments for a nested call) the (probable) thousands of clocks taken by entering the kernel (especially with page table separation) is a massive delta. If entering the kernel were cheap no one would have added the DSO functions for getting the time of day. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
* Andy Lutomirski: > This is quite clever, but now I’m wondering just how much kernel help > is really needed. In your series, the trampoline is an non-executable > page. I can think of at least two alternative approaches, and I'd > like to know the pros and cons. > > 1. Entirely userspace: a return trampoline would be something like: > > 1: > pushq %rax > pushq %rbc > pushq %rcx > ... > pushq %r15 > movq %rsp, %rdi # pointer to saved regs > leaq 1b(%rip), %rsi # pointer to the trampoline itself > callq trampoline_handler # see below > > You would fill a page with a bunch of these, possibly compacted to get > more per page, and then you would remap as many copies as needed. libffi does something like this for iOS, I believe. The only thing you really need is a PC-relative indirect call, with the target address loaded from a different page. The trampoline handler can do all the rest because it can identify the trampoline from the stack. Having a closure parameter loaded into a register will speed things up, of course. I still hope to transition libffi to this model for most Linux targets. It really simplifies things because you don't have to deal with cache flushes (on both the data and code aliases for SELinux support). But the key observation is that efficient trampolines do not need run-time code generation at all because their code is so regular. Thanks, Florian
On 7/29/20 3:36 AM, David Laight wrote: > From: Madhavan T. Venkataraman >> Sent: 28 July 2020 19:52 > ... >> trampfd faults are instruction faults that go through a different code path than >> the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that >> is time consuming. Could you clarify? > Given that the expectation is a few instructions in userspace > (eg to pick up the original arguments for a nested call) > the (probable) thousands of clocks taken by entering the > kernel (especially with page table separation) is a massive > delta. > > If entering the kernel were cheap no one would have added > the DSO functions for getting the time of day. I hear you. BTW, I did not say that the overhead was trivial. I only said that in most cases, applications may not mind that extra overhead. However, since multiple people have raised that as an issue, I will address it. I mentioned before that the kernel can actually supply the code page that sets the context and jumps to a PC and map it so the performance issue can be addressed. I was planning to do that as a future enhancement. If there is a consensus that I must address it immediately, I could do that. I will continue this discussion in my reply to Andy's email. Let us pick it up from there. Thanks. Madhavan > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
> This is quite clever, but now I’m wondering just how much kernel help > is really needed. In your series, the trampoline is an non-executable > page. I can think of at least two alternative approaches, and I'd > like to know the pros and cons. > > 1. Entirely userspace: a return trampoline would be something like: > > 1: > pushq %rax > pushq %rbc > pushq %rcx > ... > pushq %r15 > movq %rsp, %rdi # pointer to saved regs > leaq 1b(%rip), %rsi # pointer to the trampoline itself > callq trampoline_handler # see below For nested calls (where the trampoline needs to pass the original stack frame to the nested function) I think you just need a page full of: mov $0, scratch_reg; jmp trampoline_handler mov $1, scratch_reg; jmp trampoline_handler You need an unused register, on x86-64 I think both r10 and r11 are available. On i386 I think eax can be used. It might even be that the first argument register is available - if that is used to pass in the stack frame. The trampoline_handler then uses the passed in value to index an array of stack frame and function pointers and jumps to the real function. You need to hold everything in __thread data. And maybe be able to allocate an extra page for deeply nested code paths (eg recursive nested functions). You might then need a driver to create you a suitable executable page. Somehow you need to pass in the address of the trampoline_handler and the number for the first fault. It need to pass back the 'stride' of the array and number of elements created. But if you can take the cost of the page fault, then you can interpret the existing trampoline in userspace within the signal handler. This is two kernel entry/exits. Arbitrary JIT is a different problem entirely. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
For some reason my email program is not delivering to all the recipients because of some formatting issues. I am resending. I apologize. I will try to get this fixed. Sorry for the delay. I just needed to think about it a little. I will respond to your first suggestion in this email. I will respond to the others in separate emails if that is alright with you. On 7/28/20 12:31 PM, Andy Lutomirski wrote: >> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote: >> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> >> >> The kernel creates the trampoline mapping without any permissions. When >> the trampoline is executed by user code, a page fault happens and the >> kernel gets control. The kernel recognizes that this is a trampoline >> invocation. It sets up the user registers based on the specified >> register context, and/or pushes values on the user stack based on the >> specified stack context, and sets the user PC to the requested target >> PC. When the kernel returns, execution continues at the target PC. >> So, the kernel does the work of the trampoline on behalf of the >> application. > This is quite clever, but now I’m wondering just how much kernel help > is really needed. In your series, the trampoline is an non-executable > page. I can think of at least two alternative approaches, and I'd > like to know the pros and cons. > > 1. Entirely userspace: a return trampoline would be something like: > > 1: > pushq %rax > pushq %rbc > pushq %rcx > ... > pushq %r15 > movq %rsp, %rdi # pointer to saved regs > leaq 1b(%rip), %rsi # pointer to the trampoline itself > callq trampoline_handler # see below > > You would fill a page with a bunch of these, possibly compacted to get > more per page, and then you would remap as many copies as needed. The > 'callq trampoline_handler' part would need to be a bit clever to make > it continue to work despite this remapping. This will be *much* > faster than trampfd. How much of your use case would it cover? For > the inverse, it's not too hard to write a bit of asm to set all > registers and jump somewhere. Let me state my understanding of what you are suggesting. Correct me if I get anything wrong. If you don't mind, I will also take the liberty of generalizing and paraphrasing your suggestion. The goal is to create two page mappings that are adjacent to each other: - a code page that contains template code for a trampoline. Since the template code would tend to be small in size, pack as many of them as possible within a page to conserve memory. In other words, create an array of the template code fragments. Each element in the array would be used for one trampoline instance. - a data page that contains an array of data elements. Corresponding to each code element in the code page, there would be a data element in the data page that would contain data that is specific to a trampoline instance. - Code will access data using PC-relative addressing. The management of the code pages and allocation for each trampoline instance would all be done in user space. Is this the general idea? Creating a code page ---------------------------- We can do this in one of the following ways: - Allocate a writable page at run time, write the template code into the page and have execute permissions on the page. - Allocate a writable page at run time, write the template code into the page and remap the page with just execute permissions. - Allocate a writable page at run time, write the template code into the page, write the page into a temporary file and map the file with execute permissions. - Include the template code in a code page at build time itself and just remap the code page each time you need a code page. Pros and Cons ------------------- As long as the OS provides the functionality to do this and the security subsystem in the OS allows the actions, this is totally feasible. If not, we need something like trampfd. As Floren mentioned, libffi does implement something like this for MACH. In fact, in my libffi changes, I use trampfd only after all the other methods have failed because of security settings. But the above approach only solves the problem for this simple type of trampoline. It does not provide a framework for addressing more complex types or even other forms of dynamic code. Also, each application would need to implement this solution for itself as opposed to relying on one implementation provided by the kernel. Trampfd-based solution ------------------------------- I outlined an enhancement to trampfd in a response to David Laight. In this enhancement, the kernel is the one that would set up the code page. The kernel would call an arch-specific support function to generate the code required to load registers, push values on the stack and jump to a PC for a trampoline instance based on its current context. The trampoline instance data could be baked into the code. My initial idea was to only have one trampoline instance per page. But I think I can implement multiple instances per page. I just have to manage the trampfd file private data and VMA private data accordingly to map an element in a code page to its trampoline object. The two approaches are similar except for the detail about who sets up and manages the trampoline pages. In both approaches, the performance problem is addressed. But trampfd can be used even when security settings are restrictive. Is my solution acceptable? A couple of things ------------------------ - In the current trampfd implementation, no physical pages are actually allocated. It is just a virtual mapping. From a memory footprint perspective, this is good. May be, we can let the user specify if he wants a fast trampoline that consumes memory or a slow one that doesn't? - In the future, we may define additional types that need the kernel to do the job. Examples: - The kernel may have a trampoline type for which it is not willing or able to generate code - The kernel could emulate dynamic code for the user - The kernel could interpret dynamic code for the user - The kernel could allow the user to access some kernel functionality using the framework In such cases, there isn't any physical code page that gets mapped into the user address space. We need the kernel to handle the address fault and provide the functionality. One question for the reviewers ---------------------------------------- Do you think that the file descriptor based approach is fine? Or, does this need a regular system call based implementation? There are some advantages with a regular system call: - We don't consume file descriptors. E.g., in libffi, we have to keep the file descriptor open for a closure until the closure is freed. - Trampoline operations can be performed based on the trampoline address instead of an fd. - Sharing of objects across processes can be implemented through a regular ID based method rather than sending the file descriptor over a unix domain socket. - Shared objects can be persistent. - An fd based API does structure parsing in read()/write() calls to obtain arguments. With a regular system call, that is not necessary. Please let me know your thoughts. Madhavan
On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman <madvenka@linux.microsoft.com> wrote: > > Sorry for the delay. I just wanted to think about this a little. > In this email, I will respond to your first suggestion. I will > respond to the rest in separate emails if that is alright with > you. > > On 7/28/20 12:31 PM, Andy Lutomirski wrote: > > On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote: > > From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> > > The kernel creates the trampoline mapping without any permissions. When > the trampoline is executed by user code, a page fault happens and the > kernel gets control. The kernel recognizes that this is a trampoline > invocation. It sets up the user registers based on the specified > register context, and/or pushes values on the user stack based on the > specified stack context, and sets the user PC to the requested target > PC. When the kernel returns, execution continues at the target PC. > So, the kernel does the work of the trampoline on behalf of the > application. > > This is quite clever, but now I’m wondering just how much kernel help > is really needed. In your series, the trampoline is an non-executable > page. I can think of at least two alternative approaches, and I'd > like to know the pros and cons. > > 1. Entirely userspace: a return trampoline would be something like: > > 1: > pushq %rax > pushq %rbc > pushq %rcx > ... > pushq %r15 > movq %rsp, %rdi # pointer to saved regs > leaq 1b(%rip), %rsi # pointer to the trampoline itself > callq trampoline_handler # see below > > You would fill a page with a bunch of these, possibly compacted to get > more per page, and then you would remap as many copies as needed. The > 'callq trampoline_handler' part would need to be a bit clever to make > it continue to work despite this remapping. This will be *much* > faster than trampfd. How much of your use case would it cover? For > the inverse, it's not too hard to write a bit of asm to set all > registers and jump somewhere. > > Let me state what I have understood about this suggestion. Correct me if > I get anything wrong. If you don't mind, I will also take the liberty > of generalizing and paraphrasing your suggestion. > > The goal is to create two page mappings that are adjacent to each other: > > - a code page that contains template code for a trampoline. Since the > template code would tend to be small in size, pack as many of them > as possible within a page to conserve memory. In other words, create > an array of the template code fragments. Each element in the array > would be used for one trampoline instance. > > - a data page that contains an array of data elements. Corresponding > to each code element in the code page, there would be a data element > in the data page that would contain data that is specific to a > trampoline instance. > > - Code will access data using PC-relative addressing. > > The management of the code pages and allocation for each trampoline > instance would all be done in user space. > > Is this the general idea? Yes. > > Creating a code page > -------------------- > > We can do this in one of the following ways: > > - Allocate a writable page at run time, write the template code into > the page and have execute permissions on the page. > > - Allocate a writable page at run time, write the template code into > the page and remap the page with just execute permissions. > > - Allocate a writable page at run time, write the template code into > the page, write the page into a temporary file and map the file with > execute permissions. > > - Include the template code in a code page at build time itself and > just remap the code page each time you need a code page. This latter part shouldn't need any special permissions as far as I know. > > Pros and Cons > ------------- > > As long as the OS provides the functionality to do this and the security > subsystem in the OS allows the actions, this is totally feasible. If not, > we need something like trampfd. > > As Floren mentioned, libffi does implement something like this for MACH. > > In fact, in my libffi changes, I use trampfd only after all the other methods > have failed because of security settings. > > But the above approach only solves the problem for this simple type of > trampoline. It does not provide a framework for addressing more complex types > or even other forms of dynamic code. > > Also, each application would need to implement this solution for itself > as opposed to relying on one implementation provided by the kernel. I would argue this is a benefit. If the whole implementation is in userspace, there is no ABI compatibility issue. The user program contains the trampoline code and the code that uses it. > > Trampfd-based solution > ---------------------- > > I outlined an enhancement to trampfd in a response to David Laight. In this > enhancement, the kernel is the one that would set up the code page. > > The kernel would call an arch-specific support function to generate the > code required to load registers, push values on the stack and jump to a PC > for a trampoline instance based on its current context. The trampoline > instance data could be baked into the code. > > My initial idea was to only have one trampoline instance per page. But I > think I can implement multiple instances per page. I just have to manage > the trampfd file private data and VMA private data accordingly to map an > element in a code page to its trampoline object. > > The two approaches are similar except for the detail about who sets up > and manages the trampoline pages. In both approaches, the performance problem > is addressed. But trampfd can be used even when security settings are > restrictive. > > Is my solution acceptable? Perhaps. In general, before adding a new ABI to the kernel, it's nice to understand how it's better than doing the same thing in userspace. Saying that it's easier for user code to work with if it's in the kernel isn't necessarily an adequate justification. Why would remapping two pages of actual application text ever fail?
On 7/30/20 3:54 PM, Andy Lutomirski wrote: > On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman > <madvenka@linux.microsoft.com> wrote: >> ... >> Creating a code page >> -------------------- >> >> We can do this in one of the following ways: >> >> - Allocate a writable page at run time, write the template code into >> the page and have execute permissions on the page. >> >> - Allocate a writable page at run time, write the template code into >> the page and remap the page with just execute permissions. >> >> - Allocate a writable page at run time, write the template code into >> the page, write the page into a temporary file and map the file with >> execute permissions. >> >> - Include the template code in a code page at build time itself and >> just remap the code page each time you need a code page. > This latter part shouldn't need any special permissions as far as I know. Agreed. > >> Pros and Cons >> ------------- >> >> As long as the OS provides the functionality to do this and the security >> subsystem in the OS allows the actions, this is totally feasible. If not, >> we need something like trampfd. >> >> As Floren mentioned, libffi does implement something like this for MACH. >> >> In fact, in my libffi changes, I use trampfd only after all the other methods >> have failed because of security settings. >> >> But the above approach only solves the problem for this simple type of >> trampoline. It does not provide a framework for addressing more complex types >> or even other forms of dynamic code. >> >> Also, each application would need to implement this solution for itself >> as opposed to relying on one implementation provided by the kernel. > I would argue this is a benefit. If the whole implementation is in > userspace, there is no ABI compatibility issue. The user program > contains the trampoline code and the code that uses it. The current trampfd implementation also does not have an ABI issue. ABI details are to be handled in user land. In the case of libffi, they are. Trampfd only addresses the trampoline required to jump to the ABI handler. > >> Trampfd-based solution >> ---------------------- >> >> I outlined an enhancement to trampfd in a response to David Laight. In this >> enhancement, the kernel is the one that would set up the code page. >> >> The kernel would call an arch-specific support function to generate the >> code required to load registers, push values on the stack and jump to a PC >> for a trampoline instance based on its current context. The trampoline >> instance data could be baked into the code. >> >> My initial idea was to only have one trampoline instance per page. But I >> think I can implement multiple instances per page. I just have to manage >> the trampfd file private data and VMA private data accordingly to map an >> element in a code page to its trampoline object. >> >> The two approaches are similar except for the detail about who sets up >> and manages the trampoline pages. In both approaches, the performance problem >> is addressed. But trampfd can be used even when security settings are >> restrictive. >> >> Is my solution acceptable? > Perhaps. In general, before adding a new ABI to the kernel, it's nice > to understand how it's better than doing the same thing in userspace. > Saying that it's easier for user code to work with if it's in the > kernel isn't necessarily an adequate justification. Fair enough. Dealing with multiple architectures ----------------------------------------------- One good reason to use trampfd is multiple architecture support. The trampoline table in a code page approach is neat. I don't deny that at all. But my question is - can it be used in all cases? It requires PC-relative data references. I have not worked on all architectures. So, I need to study this. But do all ISAs support PC-relative data references? Even in an ISA that supports it, there would be a maximum supported offset from the current PC that can be reached for a data reference. That maximum needs to be at least the size of a base page in the architecture. This is because the code page and the data page need to be separate for security reasons. Do all ISAs support a sufficiently large offset? When the kernel generates the code for a trampoline, it can hard code data values in the generated code itself so it does not need PC-relative data referencing. And, for ISAs that do support the large offset, we do have to implement and maintain the code page stuff for different ISAs for each application and library if we did not use trampfd. If you look at the libffi reference patch that I have linked in the cover letter, I have added functions in common code that wrap trampfd calls. From architecture specific code, there is just one function call to one of those wrapper functions to set the register context for the trampoline. This is a very small C code change in each architecture. So, support can be extended to all architectures without exception easily. Runtime generated trampolines ------------------------------------------- libffi trampolines are simple. But there may be many cases out there where the trampoline code cannot be statically defined at build time. It may have to be generated at runtime. For this, we will need trampfd. Security ----------- With the user level trampoline table approach, the data part of the trampoline table can be hacked by an attacker if an application has a vulnerability. Specifically, the target PC can be altered to some arbitrary location. Trampfd implements an "Allowed PCS" context. In the libffi changes, I have created a read-only array of all ABI handlers used in closures for each architecture. This read-only array can be used to restrict the PC values for libffi trampolines to prevent hacking. To generalize, we can implement security rules/features if the trampoline object is in the kernel. Standardization --------------------- Trampfd is a framework that can be used to implement multiple things. May be, a few of those things can also be implemented in user land itself. But I think having just one mechanism to execute dynamic code objects is preferable to having multiple mechanisms not standardized across all applications. As an example, let us say that I am able to implement support for JIT code. Let us say that an interpreter uses libffi to execute a generated function. The interpreter would use trampfd for the JIT code object and get an address. Then, it would pass that to libffi which would then use trampfd for the trampoline. So, trampfd based code objects can be chained. > Why would remapping two pages of actual application text ever fail? Remapping a page may not be available on all OSes. However, that is not a problem for the code page approach. One can always memory map the code page from the binary file directly. So, yes, this would not fail. Madhavan
Hi, On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote: > From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> > Trampoline code is placed either in a data page or in a stack page. In > order to execute a trampoline, the page it resides in needs to be mapped > with execute permissions. Writable pages with execute permissions provide > an attack surface for hackers. Attackers can use this to inject malicious > code, modify existing code or do other harm. For the purpose of below, IIUC this assumes the adversary has an arbitrary write. > To mitigate this, LSMs such as SELinux may not allow pages to have both > write and execute permissions. This prevents trampolines from executing > and blocks applications that use trampolines. To allow genuine applications > to run, exceptions have to be made for them (by setting execmem, etc). > In this case, the attack surface is just the pages of such applications. > > An application that is not allowed to have writable executable pages > may try to load trampoline code into a file and map the file with execute > permissions. In this case, the attack surface is just the buffer that > contains trampoline code. However, a successful exploit may provide the > hacker with means to load his own code in a file, map it and execute it. It's not clear to me what power the adversary is assumed to have here, and consequently it's not clear to me how the proposal mitigates this. For example, if the attack can control the arguments to syscalls, and has an arbitrary write as above, what prevents them from creating a trampfd of their own? [...] > GCC has traditionally used trampolines for implementing nested > functions. The trampoline is placed on the user stack. So, the stack > needs to be executable. IIUC generally nested functions are avoided these days, specifically to prevent the creation of gadgets on the stack. So I don't think those are relevant as a cased to care about. Applications using them should move to not using them, and would be more secure generally for doing so. [...] > Trampoline File Descriptor (trampfd) > -------------------------- > > I am proposing a kernel API using anonymous file descriptors that > can be used to create and execute trampolines with the help of the > kernel. In this solution also, the kernel does the work of the trampoline. What's the rationale for the kernel emulating the trampoline here? In ther case of EMUTRAMP this was necessary to work with existing application binaries and kernel ABIs which placed instructions onto the stack, and the stack needed to remain RW for other reasons. That restriction doesn't apply here. Assuming trampfd creation is somehow authenticated, the code could be placed in a r-x page (which the kernel could refuse to add write permission), in order to prevent modification. If that's sufficient, it's not much of a leap to allow userspace to generate the code. > The kernel creates the trampoline mapping without any permissions. When > the trampoline is executed by user code, a page fault happens and the > kernel gets control. The kernel recognizes that this is a trampoline > invocation. It sets up the user registers based on the specified > register context, and/or pushes values on the user stack based on the > specified stack context, and sets the user PC to the requested target > PC. When the kernel returns, execution continues at the target PC. > So, the kernel does the work of the trampoline on behalf of the > application. > > In this case, the attack surface is the context buffer. A hacker may > attack an application with a vulnerability and may be able to modify the > context buffer. So, when the register or stack context is set for > a trampoline, the values may have been tampered with. From an attack > surface perspective, this is similar to Trampoline Emulation. But > with trampfd, user code can retrieve a trampoline's context from the > kernel and add defensive checks to see if the context has been > tampered with. Can you elaborate on this: what sort of checks would be applied, and how? Why is this not possible in a r-x user page? [...] > - trampfd provides a basic framework. In the future, new trampoline types > can be implemented, new contexts can be defined, and additional rules > can be implemented for security purposes. From a kernel developer perspective, this reads as "this ABI will become more complex", which I think is worrisome. I'm also worried that this is liable to have nasty interaction with HW CFI mechanisms (e.g. PAC+BTI on arm64) either now or in future, and that we bake incompatibility into ABI. > - For instance, trampfd defines an "Allowed PCs" context in this initial > work. As an example, libffi can create a read-only array of all ABI > handlers for an architecture at build time. This array can be used to > set the list of allowed PCs for a trampoline. This will mean that a hacker > cannot hack the PC part of the register context and make it point to > arbitrary locations. I'm not exactly sure what's meant here. Do you mean that this prevents userspace from branching into the middle of a trampoline, or that the trampfd code prevents where the trampoline itself can branch to? Both x86 and arm64 have upcoming HW CFI (CET and BTI) to deal with the former, and I believe the latter can also be implemented in userspace with defensive checks in the trampolines, provided that they are protected read-only. > - An SELinux setting called "exectramp" can be implemented along the > lines of "execmem", "execstack" and "execheap" to selectively allow the > use of trampolines on a per application basis. > > - User code can add defensive checks in the code before invoking a > trampoline to make sure that a hacker has not modified the context data. > It can do this by getting the trampoline context from the kernel and > double checking it. As above, without examples it's not clear to me what sort of chacks are possible nor where they wouild need to be made. So it's difficult to see whether that's actually possible or subject to TOCTTOU races and similar. > - In the future, if the kernel can be enhanced to use a safe code > generation component, that code can be placed in the trampoline mapping > pages. Then, the trampoline invocation does not have to incur a trip > into the kernel. > > - Also, if the kernel can be enhanced to use a safe code generation > component, other forms of dynamic code such as JIT code can be > addressed by the trampfd framework. I don't see why it's necessary for the kernel to generate code at all. If the trampfd creation requests can be trusted, what prevents trusting a sealed set of instructions generated in userspace? > - Trampolines can be shared across processes which can give rise to > interesting uses in the future. This sounds like the use-case of a sealed memfd. Is a sealed executable memfd not sufficient? Thanks, Mark.
On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote: > On 7/30/20 3:54 PM, Andy Lutomirski wrote: > > On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman > > <madvenka@linux.microsoft.com> wrote: > Dealing with multiple architectures > ----------------------------------------------- > > One good reason to use trampfd is multiple architecture support. The > trampoline table in a code page approach is neat. I don't deny that at > all. But my question is - can it be used in all cases? > > It requires PC-relative data references. I have not worked on all architectures. > So, I need to study this. But do all ISAs support PC-relative data references? Not all do, but pretty much any recent ISA will as it's a practical necessity for fast position-independent code. > Even in an ISA that supports it, there would be a maximum supported offset > from the current PC that can be reached for a data reference. That maximum > needs to be at least the size of a base page in the architecture. This is because > the code page and the data page need to be separate for security reasons. > Do all ISAs support a sufficiently large offset? ISAs with pc-relative addessing can usually generate PC-relative addresses into a GPR, from which they can apply an arbitrarily large offset. > When the kernel generates the code for a trampoline, it can hard code data values > in the generated code itself so it does not need PC-relative data referencing. > > And, for ISAs that do support the large offset, we do have to implement and > maintain the code page stuff for different ISAs for each application and library > if we did not use trampfd. Trampoline code is architecture specific today, so I don't see that as a major issue. Common structural bits can probably be shared even if the specifid machine code cannot. [...] > Security > ----------- > > With the user level trampoline table approach, the data part of the trampoline table > can be hacked by an attacker if an application has a vulnerability. Specifically, the > target PC can be altered to some arbitrary location. Trampfd implements an > "Allowed PCS" context. In the libffi changes, I have created a read-only array of > all ABI handlers used in closures for each architecture. This read-only array > can be used to restrict the PC values for libffi trampolines to prevent hacking. > > To generalize, we can implement security rules/features if the trampoline > object is in the kernel. I don't follow this argument. If it's possible to statically define that in the kernel, it's also possible to do that in userspace without any new kernel support. [...] > Trampfd is a framework that can be used to implement multiple things. May be, > a few of those things can also be implemented in user land itself. But I think having > just one mechanism to execute dynamic code objects is preferable to having > multiple mechanisms not standardized across all applications. In abstract, having a common interface sounds nice, but in practice elements of this are always architecture-specific (e.g. interactiosn with HW CFI), and that common interface can result in more pain as it doesn't fit naturally into the context that ISAs were designed for (e.g. where control-flow instructions are extended with new semantics). It also meass that you can't share the rough approach across OSs which do not implement an identical mechanism, so for code abstracting by ISA first, then by platform/ABI, there isn't much saving. Thanks, Mark.
Thanks for the comments. I will respond to these and your next email on Monday. Madhavan On 7/31/20 1:09 PM, Mark Rutland wrote: > Hi, > > On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote: >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> >> Trampoline code is placed either in a data page or in a stack page. In >> order to execute a trampoline, the page it resides in needs to be mapped >> with execute permissions. Writable pages with execute permissions provide >> an attack surface for hackers. Attackers can use this to inject malicious >> code, modify existing code or do other harm. > For the purpose of below, IIUC this assumes the adversary has an > arbitrary write. > >> To mitigate this, LSMs such as SELinux may not allow pages to have both >> write and execute permissions. This prevents trampolines from executing >> and blocks applications that use trampolines. To allow genuine applications >> to run, exceptions have to be made for them (by setting execmem, etc). >> In this case, the attack surface is just the pages of such applications. >> >> An application that is not allowed to have writable executable pages >> may try to load trampoline code into a file and map the file with execute >> permissions. In this case, the attack surface is just the buffer that >> contains trampoline code. However, a successful exploit may provide the >> hacker with means to load his own code in a file, map it and execute it. > It's not clear to me what power the adversary is assumed to have here, > and consequently it's not clear to me how the proposal mitigates this. > > For example, if the attack can control the arguments to syscalls, and > has an arbitrary write as above, what prevents them from creating a > trampfd of their own? > > [...] > >> GCC has traditionally used trampolines for implementing nested >> functions. The trampoline is placed on the user stack. So, the stack >> needs to be executable. > IIUC generally nested functions are avoided these days, specifically to > prevent the creation of gadgets on the stack. So I don't think those are > relevant as a cased to care about. Applications using them should move > to not using them, and would be more secure generally for doing so. > > [...] > >> Trampoline File Descriptor (trampfd) >> -------------------------- >> >> I am proposing a kernel API using anonymous file descriptors that >> can be used to create and execute trampolines with the help of the >> kernel. In this solution also, the kernel does the work of the trampoline. > What's the rationale for the kernel emulating the trampoline here? > > In ther case of EMUTRAMP this was necessary to work with existing > application binaries and kernel ABIs which placed instructions onto the > stack, and the stack needed to remain RW for other reasons. That > restriction doesn't apply here. > > Assuming trampfd creation is somehow authenticated, the code could be > placed in a r-x page (which the kernel could refuse to add write > permission), in order to prevent modification. If that's sufficient, > it's not much of a leap to allow userspace to generate the code. > >> The kernel creates the trampoline mapping without any permissions. When >> the trampoline is executed by user code, a page fault happens and the >> kernel gets control. The kernel recognizes that this is a trampoline >> invocation. It sets up the user registers based on the specified >> register context, and/or pushes values on the user stack based on the >> specified stack context, and sets the user PC to the requested target >> PC. When the kernel returns, execution continues at the target PC. >> So, the kernel does the work of the trampoline on behalf of the >> application. >> >> In this case, the attack surface is the context buffer. A hacker may >> attack an application with a vulnerability and may be able to modify the >> context buffer. So, when the register or stack context is set for >> a trampoline, the values may have been tampered with. From an attack >> surface perspective, this is similar to Trampoline Emulation. But >> with trampfd, user code can retrieve a trampoline's context from the >> kernel and add defensive checks to see if the context has been >> tampered with. > Can you elaborate on this: what sort of checks would be applied, and > how? > > Why is this not possible in a r-x user page? > > [...] > >> - trampfd provides a basic framework. In the future, new trampoline types >> can be implemented, new contexts can be defined, and additional rules >> can be implemented for security purposes. > >From a kernel developer perspective, this reads as "this ABI will become > more complex", which I think is worrisome. > > I'm also worried that this is liable to have nasty interaction with HW > CFI mechanisms (e.g. PAC+BTI on arm64) either now or in future, and that > we bake incompatibility into ABI. > >> - For instance, trampfd defines an "Allowed PCs" context in this initial >> work. As an example, libffi can create a read-only array of all ABI >> handlers for an architecture at build time. This array can be used to >> set the list of allowed PCs for a trampoline. This will mean that a hacker >> cannot hack the PC part of the register context and make it point to >> arbitrary locations. > I'm not exactly sure what's meant here. Do you mean that this prevents > userspace from branching into the middle of a trampoline, or that the > trampfd code prevents where the trampoline itself can branch to? > > Both x86 and arm64 have upcoming HW CFI (CET and BTI) to deal with the > former, and I believe the latter can also be implemented in userspace > with defensive checks in the trampolines, provided that they are > protected read-only. > >> - An SELinux setting called "exectramp" can be implemented along the >> lines of "execmem", "execstack" and "execheap" to selectively allow the >> use of trampolines on a per application basis. >> >> - User code can add defensive checks in the code before invoking a >> trampoline to make sure that a hacker has not modified the context data. >> It can do this by getting the trampoline context from the kernel and >> double checking it. > As above, without examples it's not clear to me what sort of chacks are > possible nor where they wouild need to be made. So it's difficult to see > whether that's actually possible or subject to TOCTTOU races and > similar. > >> - In the future, if the kernel can be enhanced to use a safe code >> generation component, that code can be placed in the trampoline mapping >> pages. Then, the trampoline invocation does not have to incur a trip >> into the kernel. >> >> - Also, if the kernel can be enhanced to use a safe code generation >> component, other forms of dynamic code such as JIT code can be >> addressed by the trampfd framework. > I don't see why it's necessary for the kernel to generate code at all. > If the trampfd creation requests can be trusted, what prevents trusting > a sealed set of instructions generated in userspace? > >> - Trampolines can be shared across processes which can give rise to >> interesting uses in the future. > This sounds like the use-case of a sealed memfd. Is a sealed executable > memfd not sufficient? > > Thanks, > Mark.
Hi! > > This is quite clever, but now I???m wondering just how much kernel help > > is really needed. In your series, the trampoline is an non-executable > > page. I can think of at least two alternative approaches, and I'd > > like to know the pros and cons. > > > > 1. Entirely userspace: a return trampoline would be something like: > > > > 1: > > pushq %rax > > pushq %rbc > > pushq %rcx > > ... > > pushq %r15 > > movq %rsp, %rdi # pointer to saved regs > > leaq 1b(%rip), %rsi # pointer to the trampoline itself > > callq trampoline_handler # see below > > For nested calls (where the trampoline needs to pass the > original stack frame to the nested function) I think you > just need a page full of: > mov $0, scratch_reg; jmp trampoline_handler I believe you could do with mov %pc, scratch_reg; jmp ... That has advantage of being able to share single physical page across multiple virtual pages... Pavel
* Madhavan T. Venkataraman: > Standardization > --------------------- > > Trampfd is a framework that can be used to implement multiple > things. May be, a few of those things can also be implemented in > user land itself. But I think having just one mechanism to execute > dynamic code objects is preferable to having multiple mechanisms not > standardized across all applications. > > As an example, let us say that I am able to implement support for > JIT code. Let us say that an interpreter uses libffi to execute a > generated function. The interpreter would use trampfd for the JIT > code object and get an address. Then, it would pass that to libffi > which would then use trampfd for the trampoline. So, trampfd based > code objects can be chained. There is certainly value in coordination. For example, it would be nice if unwinders could recognize the trampolines during all phases and unwind correctly through them (including when interrupted by an asynchronous symbol). That requires some level of coordination with the unwinder and dynamic linker. A kernel solution could hide the intermediate state in a kernel-side trap handler, but I think it wouldn't reduce the overall complexity.
More responses inline.. On 7/28/20 12:31 PM, Andy Lutomirski wrote: >> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote: >> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> >> > > 2. Use existing kernel functionality. Raise a signal, modify the > state, and return from the signal. This is very flexible and may not > be all that much slower than trampfd. Let me understand this. You are saying that the trampoline code would raise a signal and, in the signal handler, set up the context so that when the signal handler returns, we end up in the target function with the context correctly set up. And, this trampoline code can be generated statically at build time so that there are no security issues using it. Have I understood your suggestion correctly? So, my argument would be that this would always incur the overhead of a trip to the kernel. I think twice the overhead if I am not mistaken. With trampfd, we can have the kernel generate the code so that there is no performance penalty at all. Signals have many problems. Which signal number should we use for this purpose? If we use an existing one, that might conflict with what the application is already handling. Getting a new signal number for this could meet with resistance from the community. Also, signals are asynchronous. So, they are vulnerable to race conditions. To prevent other signals from coming in while handling the raised signal, we would need to block and unblock signals. This will cause more overhead. > 3. Use a syscall. Instead of having the kernel handle page faults, > have the trampoline code push the syscall nr register, load a special > new syscall nr into the syscall nr register, and do a syscall. On > x86_64, this would be: > > pushq %rax > movq __NR_magic_trampoline, %rax > syscall > > with some adjustment if the stack slot you're clobbering is important. How is this better than the kernel handling an address fault? The system call still needs to do the same work as the fault handler. We do need to specify the register and stack contexts before hand so the system call can do its job. Also, this always incurs a trip to the kernel. With trampfd, the kernel could generate the code to avoid the performance penalty. > > Also, will using trampfd cause issues with various unwinders? I can > easily imagine unwinders expecting code to be readable, although this > is slowly going away for other reasons. I need to study unwinders a little before I respond to this question. So, bear with me. > All this being said, I think that the kernel should absolutely add a > sensible interface for JITs to use to materialize their code. This > would integrate sanely with LSMs and wouldn't require hacks like using > files, etc. A cleverly designed JIT interface could function without > seriailization IPIs, and even lame architectures like x86 could > potentially avoid shootdown IPIs if the interface copied code instead > of playing virtual memory games. At its very simplest, this could be: > > void *jit_create_code(const void *source, size_t len); > > and the result would be a new anonymous mapping that contains exactly > the code requested. There could also be: > > int jittfd_create(...); > > that does something similar but creates a memfd. A nicer > implementation for short JIT sequences would allow appending more code > to an existing JIT region. On x86, an appendable JIT region would > start filled with 0xCC, and I bet there's a way to materialize new > code into a previously 0xcc-filled virtual page wthout any > synchronization. One approach would be to start with: > > <some code> > 0xcc > 0xcc > ... > 0xcc > > and to create a whole new page like: > > <some code> > <some more code> > 0xcc > ... > 0xcc > > so that the only difference is that some code changed to some more > code. Then replace the PTE to swap from the old page to the new page, > and arrange to avoid freeing the old page until we're sure it's gone > from all TLBs. This may not work if <some more code> spans a page > boundary. The #BP fixup would zap the TLB and retry. Even just > directly copying code over some 0xcc bytes almost works, but there's a > nasty corner case involving instructions that fetch I$ fetch > boundaries. I'm not sure to what extent I$ snooping helps. I am thinking that the trampfd API can be used for addressing JIT code as well. I have not yet started thinking about the details. But I think the API is sufficient. E.g., struct trampfd_jit { void *source; size_t len; }; struct trampfd_jit jit; struct trampfd_map map; void *addr; jit.source = blah; jit.size = blah; fd = syscall(440, TRAMPFD_JIT, &jit, flags); pread(fd, &map, sizeof(map), TRAMPFD_MAP_OFFSET); addr = mmap(NULL, map.size, map.prot, map.flags, fd, map.offset); And addr would be used to invoke the generated JIT code. Madhavan
On Sun, Aug 2, 2020 at 11:54 AM Madhavan T. Venkataraman <madvenka@linux.microsoft.com> wrote: > > More responses inline.. > > On 7/28/20 12:31 PM, Andy Lutomirski wrote: > >> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote: > >> > >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> > >> > > > > 2. Use existing kernel functionality. Raise a signal, modify the > > state, and return from the signal. This is very flexible and may not > > be all that much slower than trampfd. > > Let me understand this. You are saying that the trampoline code > would raise a signal and, in the signal handler, set up the context > so that when the signal handler returns, we end up in the target > function with the context correctly set up. And, this trampoline code > can be generated statically at build time so that there are no > security issues using it. > > Have I understood your suggestion correctly? yes. > > So, my argument would be that this would always incur the overhead > of a trip to the kernel. I think twice the overhead if I am not mistaken. > With trampfd, we can have the kernel generate the code so that there > is no performance penalty at all. I feel like trampfd is too poorly defined at this point to evaluate. There are three general things it could do. It could generate actual code that varies by instance. It could have static code that does not vary. And it could actually involve a kernel entry. If it involves a kernel entry, then it's slow. Maybe this is okay for some use cases. If it involves only static code, I see no good reason that it should be in the kernel. If it involves dynamic code, then I think it needs a clearly defined use case that actually requires dynamic code. > Also, signals are asynchronous. So, they are vulnerable to race conditions. > To prevent other signals from coming in while handling the raised signal, > we would need to block and unblock signals. This will cause more > overhead. If you're worried about raise() racing against signals from out of thread, you have bigger problems to deal with.
On 8/2/20 3:00 PM, Andy Lutomirski wrote: > On Sun, Aug 2, 2020 at 11:54 AM Madhavan T. Venkataraman > <madvenka@linux.microsoft.com> wrote: >> More responses inline.. >> >> On 7/28/20 12:31 PM, Andy Lutomirski wrote: >>>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote: >>>> >>>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> >>>> >>> 2. Use existing kernel functionality. Raise a signal, modify the >>> state, and return from the signal. This is very flexible and may not >>> be all that much slower than trampfd. >> Let me understand this. You are saying that the trampoline code >> would raise a signal and, in the signal handler, set up the context >> so that when the signal handler returns, we end up in the target >> function with the context correctly set up. And, this trampoline code >> can be generated statically at build time so that there are no >> security issues using it. >> >> Have I understood your suggestion correctly? > yes. > >> So, my argument would be that this would always incur the overhead >> of a trip to the kernel. I think twice the overhead if I am not mistaken. >> With trampfd, we can have the kernel generate the code so that there >> is no performance penalty at all. > I feel like trampfd is too poorly defined at this point to evaluate. > There are three general things it could do. It could generate actual > code that varies by instance. It could have static code that does not > vary. And it could actually involve a kernel entry. > > If it involves a kernel entry, then it's slow. Maybe this is okay for > some use cases. Yes. IMO, it is OK for most cases except where dynamic code is used specifically for enhancing performance such as interpreters using JIT code for frequently executed sequences and dynamic binary translation. > If it involves only static code, I see no good reason that it should > be in the kernel. It does not involve only static code. This is meant for dynamic code. However, see below. > If it involves dynamic code, then I think it needs a clearly defined > use case that actually requires dynamic code. Fair enough. I will work on this and get back to you. This might take a little time. So, bear with me. But I would like to make one point here. There are many applications and libraries out there that use trampolines. They may all require the same sort of things: - set register context - push stuff on stack - jump to a target PC But in each case, the context would be different: - only register context - only stack context - both register and stack contexts - different registers - different values pushed on the stack - different target PCs If we had to do this purely at user level, each application/library would need to roll its own solution, the solution has to be implemented for each supported architecture and maintained. While the code is static in each separate case, it is dynamic across all of them. That is, the kernel will generate the code on the fly for each trampoline instance based on its current context. It will not maintain any static trampoline code at all. Basically, it will supply the context to an arch-specific function and say: - generate instructions for loading these regs with these values - generate instructions to push these values on the stack - generate an instruction to jump to this target PC It will place all of those generated instructions on a page and return the address. So, even with the static case, there is a lot of value in the kernel providing this. Plus, it has the framework to handle dynamic code. >> Also, signals are asynchronous. So, they are vulnerable to race conditions. >> To prevent other signals from coming in while handling the raised signal, >> we would need to block and unblock signals. This will cause more >> overhead. > If you're worried about raise() racing against signals from out of > thread, you have bigger problems to deal with. Agreed. The signal blocking is just one example of problems related to signals. There are other bigger problems as well. So, let us remove the signal-based approach from our discussions. Thanks. Madhavan
From: Pavel Machek <pavel@ucw.cz> > Sent: 02 August 2020 12:56 > Hi! > > > > This is quite clever, but now I???m wondering just how much kernel help > > > is really needed. In your series, the trampoline is an non-executable > > > page. I can think of at least two alternative approaches, and I'd > > > like to know the pros and cons. > > > > > > 1. Entirely userspace: a return trampoline would be something like: > > > > > > 1: > > > pushq %rax > > > pushq %rbc > > > pushq %rcx > > > ... > > > pushq %r15 > > > movq %rsp, %rdi # pointer to saved regs > > > leaq 1b(%rip), %rsi # pointer to the trampoline itself > > > callq trampoline_handler # see below > > > > For nested calls (where the trampoline needs to pass the > > original stack frame to the nested function) I think you > > just need a page full of: > > mov $0, scratch_reg; jmp trampoline_handler > > I believe you could do with mov %pc, scratch_reg; jmp ... > > That has advantage of being able to share single physical > page across multiple virtual pages... A lot of architecture don't let you copy %pc that way so you would have to use 'call' - but that trashes the return address cache. It also needs the trampoline handler to know the addresses of the trampolines. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: Madhavan T. Venkataraman > Sent: 02 August 2020 19:55 > To: Andy Lutomirski <luto@kernel.org> > Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>; Linux API <linux-api@vger.kernel.org>; > linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; Linux FS Devel <linux- > fsdevel@vger.kernel.org>; linux-integrity <linux-integrity@vger.kernel.org>; LKML <linux- > kernel@vger.kernel.org>; LSM List <linux-security-module@vger.kernel.org>; Oleg Nesterov > <oleg@redhat.com>; X86 ML <x86@kernel.org> > Subject: Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor > > More responses inline.. > > On 7/28/20 12:31 PM, Andy Lutomirski wrote: > >> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote: > >> > >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> > >> > > > > 2. Use existing kernel functionality. Raise a signal, modify the > > state, and return from the signal. This is very flexible and may not > > be all that much slower than trampfd. > > Let me understand this. You are saying that the trampoline code > would raise a signal and, in the signal handler, set up the context > so that when the signal handler returns, we end up in the target > function with the context correctly set up. And, this trampoline code > can be generated statically at build time so that there are no > security issues using it. > > Have I understood your suggestion correctly? I was thinking that you'd just let the 'not executable' page fault signal happen (SIGSEGV?) when the code jumps to on-stack trampoline is executed. The user signal handler can then decode the faulting instruction and, if it matches the expected on-stack trampoline, modify the saved registers before returning from the signal. No kernel changes and all you need to add to the program is an architecture-dependant signal handler. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: Mark Rutland > Sent: 31 July 2020 19:32 ... > > It requires PC-relative data references. I have not worked on all architectures. > > So, I need to study this. But do all ISAs support PC-relative data references? > > Not all do, but pretty much any recent ISA will as it's a practical > necessity for fast position-independent code. i386 has neither PC-relative addressing nor moves from %pc. The cpu architecture knows that the sequence: call 1f 1: pop %reg is used to get the %pc value so is treated specially so that it doesn't 'trash' the return stack. So PIC code isn't too bad, but you have to use the correct sequence. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 8/3/20 3:08 AM, David Laight wrote: > From: Pavel Machek <pavel@ucw.cz> >> Sent: 02 August 2020 12:56 >> Hi! >> >>>> This is quite clever, but now I???m wondering just how much kernel help >>>> is really needed. In your series, the trampoline is an non-executable >>>> page. I can think of at least two alternative approaches, and I'd >>>> like to know the pros and cons. >>>> >>>> 1. Entirely userspace: a return trampoline would be something like: >>>> >>>> 1: >>>> pushq %rax >>>> pushq %rbc >>>> pushq %rcx >>>> ... >>>> pushq %r15 >>>> movq %rsp, %rdi # pointer to saved regs >>>> leaq 1b(%rip), %rsi # pointer to the trampoline itself >>>> callq trampoline_handler # see below >>> For nested calls (where the trampoline needs to pass the >>> original stack frame to the nested function) I think you >>> just need a page full of: >>> mov $0, scratch_reg; jmp trampoline_handler >> I believe you could do with mov %pc, scratch_reg; jmp ... >> >> That has advantage of being able to share single physical >> page across multiple virtual pages... > A lot of architecture don't let you copy %pc that way so you would > have to use 'call' - but that trashes the return address cache. > It also needs the trampoline handler to know the addresses > of the trampolines. Do you which ones don't allow you to copy %pc? Some of the architctures do not have PC-relative data references. If they do not allow you to copy the PC into a general purpose register, then there is no way to implement the statically defined trampoline that has been discussed so far. In these cases, the trampoline has to be generate at runtime. Thanks. Madhavan
On 8/3/20 3:23 AM, David Laight wrote: > From: Madhavan T. Venkataraman >> Sent: 02 August 2020 19:55 >> To: Andy Lutomirski <luto@kernel.org> >> Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>; Linux API <linux-api@vger.kernel.org>; >> linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; Linux FS Devel <linux- >> fsdevel@vger.kernel.org>; linux-integrity <linux-integrity@vger.kernel.org>; LKML <linux- >> kernel@vger.kernel.org>; LSM List <linux-security-module@vger.kernel.org>; Oleg Nesterov >> <oleg@redhat.com>; X86 ML <x86@kernel.org> >> Subject: Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor >> >> More responses inline.. >> >> On 7/28/20 12:31 PM, Andy Lutomirski wrote: >>>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote: >>>> >>>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> >>>> >>> 2. Use existing kernel functionality. Raise a signal, modify the >>> state, and return from the signal. This is very flexible and may not >>> be all that much slower than trampfd. >> Let me understand this. You are saying that the trampoline code >> would raise a signal and, in the signal handler, set up the context >> so that when the signal handler returns, we end up in the target >> function with the context correctly set up. And, this trampoline code >> can be generated statically at build time so that there are no >> security issues using it. >> >> Have I understood your suggestion correctly? > I was thinking that you'd just let the 'not executable' page fault > signal happen (SIGSEGV?) when the code jumps to on-stack trampoline > is executed. > > The user signal handler can then decode the faulting instruction > and, if it matches the expected on-stack trampoline, modify the > saved registers before returning from the signal. > > No kernel changes and all you need to add to the program is > an architecture-dependant signal handler. Understood. Madhavan
On 8/3/20 3:27 AM, David Laight wrote: > From: Mark Rutland >> Sent: 31 July 2020 19:32 > ... >>> It requires PC-relative data references. I have not worked on all architectures. >>> So, I need to study this. But do all ISAs support PC-relative data references? >> Not all do, but pretty much any recent ISA will as it's a practical >> necessity for fast position-independent code. > i386 has neither PC-relative addressing nor moves from %pc. > The cpu architecture knows that the sequence: > call 1f > 1: pop %reg > is used to get the %pc value so is treated specially so that > it doesn't 'trash' the return stack. > > So PIC code isn't too bad, but you have to use the correct > sequence. Is that true only for 32-bit systems only? I thought RIP-relative addressing was introduced in 64-bit mode. Please confirm. Madhavan
From: Madhavan T. Venkataraman > Sent: 03 August 2020 17:03 > > On 8/3/20 3:27 AM, David Laight wrote: > > From: Mark Rutland > >> Sent: 31 July 2020 19:32 > > ... > >>> It requires PC-relative data references. I have not worked on all architectures. > >>> So, I need to study this. But do all ISAs support PC-relative data references? > >> Not all do, but pretty much any recent ISA will as it's a practical > >> necessity for fast position-independent code. > > i386 has neither PC-relative addressing nor moves from %pc. > > The cpu architecture knows that the sequence: > > call 1f > > 1: pop %reg > > is used to get the %pc value so is treated specially so that > > it doesn't 'trash' the return stack. > > > > So PIC code isn't too bad, but you have to use the correct > > sequence. > > Is that true only for 32-bit systems only? I thought RIP-relative addressing was > introduced in 64-bit mode. Please confirm. I said i386 not amd64 or x86-64. So yes, 64bit code has PC-relative addressing. But I'm pretty sure it has no other way to get the PC itself except using call - certainly nothing in the 'usual' instructions. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Responses inline.. On 7/31/20 1:09 PM, Mark Rutland wrote: > Hi, > > On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote: >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> >> Trampoline code is placed either in a data page or in a stack page. In >> order to execute a trampoline, the page it resides in needs to be mapped >> with execute permissions. Writable pages with execute permissions provide >> an attack surface for hackers. Attackers can use this to inject malicious >> code, modify existing code or do other harm. > For the purpose of below, IIUC this assumes the adversary has an > arbitrary write. > >> To mitigate this, LSMs such as SELinux may not allow pages to have both >> write and execute permissions. This prevents trampolines from executing >> and blocks applications that use trampolines. To allow genuine applications >> to run, exceptions have to be made for them (by setting execmem, etc). >> In this case, the attack surface is just the pages of such applications. >> >> An application that is not allowed to have writable executable pages >> may try to load trampoline code into a file and map the file with execute >> permissions. In this case, the attack surface is just the buffer that >> contains trampoline code. However, a successful exploit may provide the >> hacker with means to load his own code in a file, map it and execute it. > It's not clear to me what power the adversary is assumed to have here, > and consequently it's not clear to me how the proposal mitigates this. > > For example, if the attack can control the arguments to syscalls, and > has an arbitrary write as above, what prevents them from creating a > trampfd of their own? That is the point. If a process is allowed to have pages that are both writable and executable, a hacker can exploit some vulnerability such as buffer overflow to write his own code into a page and somehow contrive to execute that. So, the context is - if security settings in a system disallow a page to have both write and execute permissions, how do you allow the execution of genuine trampolines that are runtime generated and placed in a data page or a stack page? trampfd tries to address that. So, trampfd is not a measure that increases the security of a system or mitigates a security problem. It is a framework to allow safe forms of dynamic code to execute when security settings will block them otherwise. > > [...] > >> GCC has traditionally used trampolines for implementing nested >> functions. The trampoline is placed on the user stack. So, the stack >> needs to be executable. > IIUC generally nested functions are avoided these days, specifically to > prevent the creation of gadgets on the stack. So I don't think those are > relevant as a cased to care about. Applications using them should move > to not using them, and would be more secure generally for doing so. Could not agree with you more. > > [...] > >> Trampoline File Descriptor (trampfd) >> -------------------------- >> >> I am proposing a kernel API using anonymous file descriptors that >> can be used to create and execute trampolines with the help of the >> kernel. In this solution also, the kernel does the work of the trampoline. > What's the rationale for the kernel emulating the trampoline here? > > In ther case of EMUTRAMP this was necessary to work with existing > application binaries and kernel ABIs which placed instructions onto the > stack, and the stack needed to remain RW for other reasons. That > restriction doesn't apply here. In addition to the stack, EMUTRAMP also allows the emulation of the same well-known trampolines placed in a non-stack data page. For instance, libffi closures embed a trampoline in a closure structure. That gets executed when the caller of libffi invokes it. The goal of EMUTRAMP is to allow safe trampolines to execute when security settings disallow their execution. Mainly, it permits applications that use libffi to run. A lot of applications use libffi. They chose the emulation method so that no changes need to be made to application code to use them. But the EMUTRAMP implementors note in their description that the real solution to the problem is a kernel API that is backed by a safe code generator. trampd is an attempt to define such an API. This is just a starting point. I realize that we need to have a lot of discussion to refine the approach. > Assuming trampfd creation is somehow authenticated, the code could be > placed in a r-x page (which the kernel could refuse to add write > permission), in order to prevent modification. If that's sufficient, > it's not much of a leap to allow userspace to generate the code. IIUC, you are suggesting that the user hands the kernel a code fragment and requests it to be placed in an r-x page, correct? However, the kernel cannot trust any code given to it by the user. Nor can it scan any piece of code and reliably decide if it is safe or not. So, the problem of executing dynamic code when security settings are restrictive cannot be solved in userland. The only option I can think of is to have the kernel provide support for dynamic code. It must have one or more safe, trusted code generation components and an API to use the components. My goal is to introduce an API and start off by supporting simple, regular trampolines that are widely used. Then, evolve the feature over a period of time to include other forms of dynamic code such as JIT code. >> The kernel creates the trampoline mapping without any permissions. When >> the trampoline is executed by user code, a page fault happens and the >> kernel gets control. The kernel recognizes that this is a trampoline >> invocation. It sets up the user registers based on the specified >> register context, and/or pushes values on the user stack based on the >> specified stack context, and sets the user PC to the requested target >> PC. When the kernel returns, execution continues at the target PC. >> So, the kernel does the work of the trampoline on behalf of the >> application. >> >> In this case, the attack surface is the context buffer. A hacker may >> attack an application with a vulnerability and may be able to modify the >> context buffer. So, when the register or stack context is set for >> a trampoline, the values may have been tampered with. From an attack >> surface perspective, this is similar to Trampoline Emulation. But >> with trampfd, user code can retrieve a trampoline's context from the >> kernel and add defensive checks to see if the context has been >> tampered with. > Can you elaborate on this: what sort of checks would be applied, and > how? So, an application that uses trampfd would do the following steps: 1. Create a trampoline by calling trampfd_create() 2. Set the register and/or stack contexts for the trampoline. 3. mmap() the trampoline to get an address 4. Invoke the trampoline using the address Let us say that the application has a vulnerability such as buffer overflow that allows a hacker to modify the data that is used to do step 2. Potentially, a hacker could modify the following things: - register values specified in the register context - values specified in the stack context - the target PC specified in the register context When the trampoline is invoked in step 4, the kernel will gain control, load the registers, push stuff on the stack and transfer control to the target PC. Whatever the hacker had modified in step 2 will take effect in step 4. His values will get loaded and his PC is the one that will get control. A paranoid application could add a step to this sequence. So, the steps would be: 1. Create a trampoline by calling trampfd_create() 2. Set the register and/or stack contexts for the trampoline. 3. mmap() the trampoline to get an address 4a. Retrieve the register and stack context for the trampoline from the kernel and check if anything has been altered. If yes, abort. 4b. Invoke the trampoline using the address The check that I mentioned will be in step 4a. Now, the hacker has to hack both step 2 and step 4a to let his stuff take effect. That is far less likely to succeed because there needs to exist a vulnerability in both places. > Why is this not possible in a r-x user page? This is answered above. > > [...] > >> - trampfd provides a basic framework. In the future, new trampoline types >> can be implemented, new contexts can be defined, and additional rules >> can be implemented for security purposes. > >From a kernel developer perspective, this reads as "this ABI will become > more complex", which I think is worrisome. I hear you. My goal from the beginning is to not have the kernel deal with ABI issues. ABI handling is best left to userland (except in cases like signal handlers where the kernel does have to deal with it). In the libffi changes, this is certainly true. The kernel only helps with the trampoline that passes control to the ABI handler. The ABI handler itself is part of libffi. > I'm also worried that this is liable to have nasty interaction with HW > CFI mechanisms (e.g. PAC+BTI on arm64) either now or in future, and that > we bake incompatibility into ABI. I will study CFI and then answer this question. So, bear with me. >> - For instance, trampfd defines an "Allowed PCs" context in this initial >> work. As an example, libffi can create a read-only array of all ABI >> handlers for an architecture at build time. This array can be used to >> set the list of allowed PCs for a trampoline. This will mean that a hacker >> cannot hack the PC part of the register context and make it point to >> arbitrary locations. > I'm not exactly sure what's meant here. Do you mean that this prevents > userspace from branching into the middle of a trampoline, or that the > trampfd code prevents where the trampoline itself can branch to? > > Both x86 and arm64 have upcoming HW CFI (CET and BTI) to deal with the > former, and I believe the latter can also be implemented in userspace > with defensive checks in the trampolines, provided that they are > protected read-only. So, I mentioned before that a hacker can potentially alter the target PC that a trampoline finally jumps to. If a process were allowed to have pages with both write and execute permissions, a hacker could load his own code in one of those pages and point the PC to that. In the context of trampfd, we are talking about the case where a process is not permitted to have both write and execute permissions. In this case, the hacker cannot load his own code anywhere and hope to execute it. But a hacker can point the PC to some arbitrary place such as return from glibc. > >> - An SELinux setting called "exectramp" can be implemented along the >> lines of "execmem", "execstack" and "execheap" to selectively allow the >> use of trampolines on a per application basis. >> >> - User code can add defensive checks in the code before invoking a >> trampoline to make sure that a hacker has not modified the context data. >> It can do this by getting the trampoline context from the kernel and >> double checking it. > As above, without examples it's not clear to me what sort of chacks are > possible nor where they wouild need to be made. So it's difficult to see > whether that's actually possible or subject to TOCTTOU races and > similar. I have explained this above. If there are any further questions on that, please let me know. > >> - In the future, if the kernel can be enhanced to use a safe code >> generation component, that code can be placed in the trampoline mapping >> pages. Then, the trampoline invocation does not have to incur a trip >> into the kernel. >> >> - Also, if the kernel can be enhanced to use a safe code generation >> component, other forms of dynamic code such as JIT code can be >> addressed by the trampfd framework. > I don't see why it's necessary for the kernel to generate code at all. > If the trampfd creation requests can be trusted, what prevents trusting > a sealed set of instructions generated in userspace? Let us consider a system in which: - a process is not permitted to have pages with both write and execute - a process is not permitted to map any file as executable unless it is properly signed. In other words, cryptographically verified. Then, the process cannot execute any code that is runtime generated. That includes trampolines. Only trampoline code that is part of program text at build time would be permitted to execute. In this scenario, trampfd requests are coming from signed code. So, they are trusted by the kernel. But trampoline code could be dynamically generated. The kernel will not trust it. >> - Trampolines can be shared across processes which can give rise to >> interesting uses in the future. > This sounds like the use-case of a sealed memfd. Is a sealed executable > memfd not sufficient? I will answer this in a separate email. Thanks. Madhavan
On 8/3/20 11:57 AM, David Laight wrote: > From: Madhavan T. Venkataraman >> Sent: 03 August 2020 17:03 >> >> On 8/3/20 3:27 AM, David Laight wrote: >>> From: Mark Rutland >>>> Sent: 31 July 2020 19:32 >>> ... >>>>> It requires PC-relative data references. I have not worked on all architectures. >>>>> So, I need to study this. But do all ISAs support PC-relative data references? >>>> Not all do, but pretty much any recent ISA will as it's a practical >>>> necessity for fast position-independent code. >>> i386 has neither PC-relative addressing nor moves from %pc. >>> The cpu architecture knows that the sequence: >>> call 1f >>> 1: pop %reg >>> is used to get the %pc value so is treated specially so that >>> it doesn't 'trash' the return stack. >>> >>> So PIC code isn't too bad, but you have to use the correct >>> sequence. >> Is that true only for 32-bit systems only? I thought RIP-relative addressing was >> introduced in 64-bit mode. Please confirm. > I said i386 not amd64 or x86-64. I am sorry. My bad. > > So yes, 64bit code has PC-relative addressing. > But I'm pretty sure it has no other way to get the PC itself > except using call - certainly nothing in the 'usual' instructions. OK. Madhavan
On 7/31/20 1:31 PM, Mark Rutland wrote: > On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote: >> On 7/30/20 3:54 PM, Andy Lutomirski wrote: >>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman >>> <madvenka@linux.microsoft.com> wrote: >> Dealing with multiple architectures >> ----------------------------------------------- >> >> One good reason to use trampfd is multiple architecture support. The >> trampoline table in a code page approach is neat. I don't deny that at >> all. But my question is - can it be used in all cases? >> >> It requires PC-relative data references. I have not worked on all architectures. >> So, I need to study this. But do all ISAs support PC-relative data references? > Not all do, but pretty much any recent ISA will as it's a practical > necessity for fast position-independent code. So, two questions: 1. IIUC, for position independent code, we need PC-relative control transfers. I know that PC-relative control transfers are kinda fundamental. So, I expect most architectures support it. But to implement the trampoline table suggestion, we need PC-relative data references. Like: movq X(%rip), %rax 2. Do you know which architectures do not support PC-relative data references? I am going to study this. But if you have some information, I would appreciate it. In any case, I think we should support all of the architectures on which Linux currently runs even if they are legacy. > >> Even in an ISA that supports it, there would be a maximum supported offset >> from the current PC that can be reached for a data reference. That maximum >> needs to be at least the size of a base page in the architecture. This is because >> the code page and the data page need to be separate for security reasons. >> Do all ISAs support a sufficiently large offset? > ISAs with pc-relative addessing can usually generate PC-relative > addresses into a GPR, from which they can apply an arbitrarily large > offset. I will study this. I need to nail down the list of architectures that cannot do this. > >> When the kernel generates the code for a trampoline, it can hard code data values >> in the generated code itself so it does not need PC-relative data referencing. >> >> And, for ISAs that do support the large offset, we do have to implement and >> maintain the code page stuff for different ISAs for each application and library >> if we did not use trampfd. > Trampoline code is architecture specific today, so I don't see that as a > major issue. Common structural bits can probably be shared even if the > specifid machine code cannot. True. But an implementor may prefer a standard mechanism provided by the kernel so all of his architectures can be supported easily with less effort. If you look at the libffi reference patch I have included, the architecture specific changes to use trampfd just involve a single C function call to a common code function. So, from the point of view of adoption, IMHO, the kernel provided method is preferable. > > [...] > >> Security >> ----------- >> >> With the user level trampoline table approach, the data part of the trampoline table >> can be hacked by an attacker if an application has a vulnerability. Specifically, the >> target PC can be altered to some arbitrary location. Trampfd implements an >> "Allowed PCS" context. In the libffi changes, I have created a read-only array of >> all ABI handlers used in closures for each architecture. This read-only array >> can be used to restrict the PC values for libffi trampolines to prevent hacking. >> >> To generalize, we can implement security rules/features if the trampoline >> object is in the kernel. > I don't follow this argument. If it's possible to statically define that > in the kernel, it's also possible to do that in userspace without any > new kernel support. It is not statically defined in the kernel. Let us take the libffi example. In the 64-bit X86 arch code, there are 3 ABI handlers: ffi_closure_unix64_sse ffi_closure_unix64 ffi_closure_win64 I could create an "Allowed PCs" context like this: struct my_allowed_pcs { struct trampfd_values pcs; __u64 pc_values[3]; }; const struct my_allowed_pcs my_allowed_pcs = { { 3, 0 }, (uintptr_t) ffi_closure_unix64_sse, (uintptr_t) ffi_closure_unix64, (uintptr_t) ffi_closure_win64, }; I have created a read-only array of allowed ABI handlers that closures use. When I set up the context for a closure trampoline, I could do this: pwrite(trampfd, &my_allowed_pcs, sizeof(my_allowed_pcs), TRAMPFD_ALLOWED_PCS_OFFSET); This copies the array into the trampoline object in the kernel. When the register context is set for the trampoline, the kernel checks the PC register value against allowed PCs. Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only permitted target PCs enforced by the kernel are the ABI handlers. > > [...] > >> Trampfd is a framework that can be used to implement multiple things. May be, >> a few of those things can also be implemented in user land itself. But I think having >> just one mechanism to execute dynamic code objects is preferable to having >> multiple mechanisms not standardized across all applications. > In abstract, having a common interface sounds nice, but in practice > elements of this are always architecture-specific (e.g. interactiosn > with HW CFI), and that common interface can result in more pain as it > doesn't fit naturally into the context that ISAs were designed for (e.g. > where control-flow instructions are extended with new semantics). In the case of trampfd, the code generation is indeed architecture specific. But that is in the kernel. The application is not affected by it. Again, referring to the libffi reference patch, I have defined wrapper functions for trampfd in common code. The architecture specific code in libffi only calls the set_context function defined in common code. Even this is required only because register names are specific to each architecture and the target PC (to the ABI handler) is specific to each architecture-ABI combo. > It also meass that you can't share the rough approach across OSs which > do not implement an identical mechanism, so for code abstracting by ISA > first, then by platform/ABI, there isn't much saving. Why can you not share the same approach across OSes? In fact, I have tried to design it so that other OSes can use the same mechanism. The only thing is that I have defined the API to be based on a file descriptor since that is what is generally preferred by the Linux community for a new API. If I were to implement it as a regular system call, the same system call can be implemented in other OSes as well. Thanks. Madhavan
On 8/2/20 3:00 PM, Andy Lutomirski wrote:
> I feel like trampfd is too poorly defined at this point to evaluate.
Point taken. It is because I wanted to start with something small
and specific and expand it in the future. So, I did not really describe the big
picture - the overall vision, future work, that sort of thing. In retrospect,
may be, I should have done that.
I will take all of the input I have received so far and all of the responses
I have given, refine the definition of trampfd and send it out. Please
review that and let me know if anything is still missing from the
definition.
Thanks.
Madhavan
On Mon, Aug 03, 2020 at 12:58:04PM -0500, Madhavan T. Venkataraman wrote: > On 7/31/20 1:31 PM, Mark Rutland wrote: > > On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote: > >> On 7/30/20 3:54 PM, Andy Lutomirski wrote: > >>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman > >>> <madvenka@linux.microsoft.com> wrote: >> >> When the kernel generates the code for a trampoline, it can hard code data values > >> in the generated code itself so it does not need PC-relative data referencing. > >> > >> And, for ISAs that do support the large offset, we do have to implement and > >> maintain the code page stuff for different ISAs for each application and library > >> if we did not use trampfd. > > Trampoline code is architecture specific today, so I don't see that as a > > major issue. Common structural bits can probably be shared even if the > > specifid machine code cannot. > > True. But an implementor may prefer a standard mechanism provided by > the kernel so all of his architectures can be supported easily with less > effort. > > If you look at the libffi reference patch I have included, the architecture > specific changes to use trampfd just involve a single C function call to > a common code function. Sure but in addition to that each architecture backend had to define a set of arguments to that. I view the C function is analagous to the "common structural bits". I appreciate that your patch is small today (and architectures seem to largely align on what they need), but I don't think it's necessarily true that things will remain so simple as architecture are extended and their calling conventions evolve, and I also don't think it's clear that this will work for more complex cases elsewhere. [...] > >> With the user level trampoline table approach, the data part of the trampoline table > >> can be hacked by an attacker if an application has a vulnerability. Specifically, the > >> target PC can be altered to some arbitrary location. Trampfd implements an > >> "Allowed PCS" context. In the libffi changes, I have created a read-only array of > >> all ABI handlers used in closures for each architecture. This read-only array > >> can be used to restrict the PC values for libffi trampolines to prevent hacking. > >> > >> To generalize, we can implement security rules/features if the trampoline > >> object is in the kernel. > > I don't follow this argument. If it's possible to statically define that > > in the kernel, it's also possible to do that in userspace without any > > new kernel support. > It is not statically defined in the kernel. > > Let us take the libffi example. In the 64-bit X86 arch code, there are 3 > ABI handlers: > > ffi_closure_unix64_sse > ffi_closure_unix64 > ffi_closure_win64 > > I could create an "Allowed PCs" context like this: > > struct my_allowed_pcs { > struct trampfd_values pcs; > __u64 pc_values[3]; > }; > > const struct my_allowed_pcs my_allowed_pcs = { > { 3, 0 }, > (uintptr_t) ffi_closure_unix64_sse, > (uintptr_t) ffi_closure_unix64, > (uintptr_t) ffi_closure_win64, > }; > > I have created a read-only array of allowed ABI handlers that closures use. > > When I set up the context for a closure trampoline, I could do this: > > pwrite(trampfd, &my_allowed_pcs, sizeof(my_allowed_pcs), TRAMPFD_ALLOWED_PCS_OFFSET); > > This copies the array into the trampoline object in the kernel. > When the register context is set for the trampoline, the kernel checks > the PC register value against allowed PCs. > > Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only > permitted target PCs enforced by the kernel are the ABI handlers. Sorry, when I said "statically define" meant when you knew legitimate targets ahead of time when you create the trampoline (i.e. whether you could enumerate those and know they would not change dynamically). My point was that you can achieve the same in userspace if the trampoline and array of legitimate targets are in read-only memory, without having to trap to the kernel. I think the key point here is that an adversary must be prevented from altering a trampoline and any associated metadata, and I think that there are ways of achieving that without having to trap into the kernel, and without the kernel having to be intimately aware of the calling conventions used in userspace. [...] > >> Trampfd is a framework that can be used to implement multiple things. May be, > >> a few of those things can also be implemented in user land itself. But I think having > >> just one mechanism to execute dynamic code objects is preferable to having > >> multiple mechanisms not standardized across all applications. > > In abstract, having a common interface sounds nice, but in practice > > elements of this are always architecture-specific (e.g. interactiosn > > with HW CFI), and that common interface can result in more pain as it > > doesn't fit naturally into the context that ISAs were designed for (e.g. > > where control-flow instructions are extended with new semantics). > > In the case of trampfd, the code generation is indeed architecture > specific. But that is in the kernel. The application is not affected by it. As an ABI detail, applications are *definitely* affected by this, and it is wrong to suggest they are not even if you don't have a specific case in mind today. As this forms a contract between userspace and the kernel it's overly simplistic to say that it's the kernel's problem For example, in the case of BTI on arm64, what should the trampoline set PSTATE.BTYPE to? Different use-cases *will* want different values, and not necessarily the value of PSTATE at the instant the call to the trampoline was made. In the case of libffi specifically using the original value of PSTATE.BTYPE probably is sound, but other code sequences may need to restrict/broaden or entirely change that. > Again, referring to the libffi reference patch, I have defined wrapper > functions for trampfd in common code. The architecture specific code > in libffi only calls the set_context function defined in common code. > Even this is required only because register names are specific to each > architecture and the target PC (to the ABI handler) is specific to > each architecture-ABI combo. > > > It also meass that you can't share the rough approach across OSs which > > do not implement an identical mechanism, so for code abstracting by ISA > > first, then by platform/ABI, there isn't much saving. > > Why can you not share the same approach across OSes? In fact, > I have tried to design it so that other OSes can use the same > mechanism. Sure, but where they *don't*, you must fall back to the existing purely-userspace mechanisms, and so a codebase now has the burden of maintaining two distinct mechanisms. Whereas if there's a way of doing this in userspace with (stronger) enforcement of memory permissions the trampoline code can be common for when this is present or absent, which is much easier for a codebase rto maintain, and could make use of weaker existing mechanisms to improve the situation on systems without the new functionality. Thanks, Mark.
On Mon, Aug 03, 2020 at 11:57:57AM -0500, Madhavan T. Venkataraman wrote: > Responses inline.. > > On 7/31/20 1:09 PM, Mark Rutland wrote: > > Hi, > > > > On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote: > >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> > >> Trampoline code is placed either in a data page or in a stack page. In > >> order to execute a trampoline, the page it resides in needs to be mapped > >> with execute permissions. Writable pages with execute permissions provide > >> an attack surface for hackers. Attackers can use this to inject malicious > >> code, modify existing code or do other harm. > > For the purpose of below, IIUC this assumes the adversary has an > > arbitrary write. > > > >> To mitigate this, LSMs such as SELinux may not allow pages to have both > >> write and execute permissions. This prevents trampolines from executing > >> and blocks applications that use trampolines. To allow genuine applications > >> to run, exceptions have to be made for them (by setting execmem, etc). > >> In this case, the attack surface is just the pages of such applications. > >> > >> An application that is not allowed to have writable executable pages > >> may try to load trampoline code into a file and map the file with execute > >> permissions. In this case, the attack surface is just the buffer that > >> contains trampoline code. However, a successful exploit may provide the > >> hacker with means to load his own code in a file, map it and execute it. > > It's not clear to me what power the adversary is assumed to have here, > > and consequently it's not clear to me how the proposal mitigates this. > > > > For example, if the attack can control the arguments to syscalls, and > > has an arbitrary write as above, what prevents them from creating a > > trampfd of their own? > > That is the point. If a process is allowed to have pages that are both > writable and executable, a hacker can exploit some vulnerability such > as buffer overflow to write his own code into a page and somehow > contrive to execute that. I understood that, and that was not my question. > So, the context is - if security settings in a system disallow a page to have > both write and execute permissions, how do you allow the execution of > genuine trampolines that are runtime generated and placed in a data > page or a stack page? There are options today, e.g. a) If the restriction is only per-alias, you can have distinct aliases where one is writable and another is executable, and you can make it hard to find the relationship between the two. b) If the restriction is only temporal, you can write instructions into an RW- buffer, transition the buffer to R--, verify the buffer contents, then transition it to --X. c) You can have two processes A and B where A generates instrucitons into a buffer that (only) B can execute (where B may be restricted from making syscalls like write, mprotect, etc). If (as this series appears to) you assume that an adversary can't control the arguments trampfd_create() and any such call is legitimate, then something like (b) is not weaker and can be much more general without many of the potential ABI or performance problems of trying to fiddle with precedure call details in the kernel. If that's not an assumption, then I'm missing how you expect to determine that a trampfd_create() call is legitimate, and why that could not be applied to other calls. [...] > Could not agree with you more. > > > > [...] > > > >> Trampoline File Descriptor (trampfd) > >> -------------------------- > >> > >> I am proposing a kernel API using anonymous file descriptors that > >> can be used to create and execute trampolines with the help of the > >> kernel. In this solution also, the kernel does the work of the trampoline. > > What's the rationale for the kernel emulating the trampoline here? > > > > In ther case of EMUTRAMP this was necessary to work with existing > > application binaries and kernel ABIs which placed instructions onto the > > stack, and the stack needed to remain RW for other reasons. That > > restriction doesn't apply here. > > In addition to the stack, EMUTRAMP also allows the emulation > of the same well-known trampolines placed in a non-stack data page. > For instance, libffi closures embed a trampoline in a closure structure. > That gets executed when the caller of libffi invokes it. > > The goal of EMUTRAMP is to allow safe trampolines to execute when > security settings disallow their execution. Mainly, it permits applications > that use libffi to run. A lot of applications use libffi. > > They chose the emulation method so that no changes need to be made > to application code to use them. But the EMUTRAMP implementors note > in their description that the real solution to the problem is a kernel > API that is backed by a safe code generator. > > trampd is an attempt to define such an API. This is just a starting point. > I realize that we need to have a lot of discussion to refine the approach. > > > Assuming trampfd creation is somehow authenticated, the code could be > > placed in a r-x page (which the kernel could refuse to add write > > permission), in order to prevent modification. If that's sufficient, > > it's not much of a leap to allow userspace to generate the code. > > IIUC, you are suggesting that the user hands the kernel a code fragment > and requests it to be placed in an r-x page, correct? However, the > kernel cannot trust any code given to it by the user. Nor can it scan any > piece of code and reliably decide if it is safe or not. Per that same logic the kernel cannot trust trampfd creation calls to be legitimate as the adversary could mess with the arguments. It doesn't matter if the kernel's codegen is trustworthy if it's potentially driven by an adversary. > So, the problem of executing dynamic code when security settings are > restrictive cannot be solved in userland. The only option I can think of is > to have the kernel provide support for dynamic code. It must have one > or more safe, trusted code generation components and an API to use > the components. > > My goal is to introduce an API and start off by supporting simple, regular > trampolines that are widely used. Then, evolve the feature over a period > of time to include other forms of dynamic code such as JIT code. I think that you're making a leap to this approach without sufficient justification that it actually solves the problem, and I believe that there will be ABI issues with this approach which can be sidestepped by other potential approaches. Taking a step back, I think it's necessary to better describe the problem and constraints that you believe apply before attempting to justify any potential solution. [...] > >> The kernel creates the trampoline mapping without any permissions. When > >> the trampoline is executed by user code, a page fault happens and the > >> kernel gets control. The kernel recognizes that this is a trampoline > >> invocation. It sets up the user registers based on the specified > >> register context, and/or pushes values on the user stack based on the > >> specified stack context, and sets the user PC to the requested target > >> PC. When the kernel returns, execution continues at the target PC. > >> So, the kernel does the work of the trampoline on behalf of the > >> application. > >> > >> In this case, the attack surface is the context buffer. A hacker may > >> attack an application with a vulnerability and may be able to modify the > >> context buffer. So, when the register or stack context is set for > >> a trampoline, the values may have been tampered with. From an attack > >> surface perspective, this is similar to Trampoline Emulation. But > >> with trampfd, user code can retrieve a trampoline's context from the > >> kernel and add defensive checks to see if the context has been > >> tampered with. > > Can you elaborate on this: what sort of checks would be applied, and > > how? > > So, an application that uses trampfd would do the following steps: > > 1. Create a trampoline by calling trampfd_create() > 2. Set the register and/or stack contexts for the trampoline. > 3. mmap() the trampoline to get an address > 4. Invoke the trampoline using the address > > Let us say that the application has a vulnerability such as buffer overflow > that allows a hacker to modify the data that is used to do step 2. > > Potentially, a hacker could modify the following things: > - register values specified in the register context > - values specified in the stack context > - the target PC specified in the register context > > When the trampoline is invoked in step 4, the kernel will gain control, > load the registers, push stuff on the stack and transfer control to the target > PC. Whatever the hacker had modified in step 2 will take effect in step 4. > His values will get loaded and his PC is the one that will get control. > > A paranoid application could add a step to this sequence. So, the steps > would be: > > 1. Create a trampoline by calling trampfd_create() > 2. Set the register and/or stack contexts for the trampoline. > 3. mmap() the trampoline to get an address > 4a. Retrieve the register and stack context for the trampoline from the > kernel and check if anything has been altered. If yes, abort. > 4b. Invoke the trampoline using the address As above, you can also do this when using mprotect today, transitioning the buffer RWX -> R-- -> R-X. If you're worried about subsequent modification via an alias, a sealed memfd would work assuming that can be mapped R-X. This approach is applicable to trampfd, but it isn't a specific benefit of trampfd. [...] > >> - In the future, if the kernel can be enhanced to use a safe code > >> generation component, that code can be placed in the trampoline mapping > >> pages. Then, the trampoline invocation does not have to incur a trip > >> into the kernel. > >> > >> - Also, if the kernel can be enhanced to use a safe code generation > >> component, other forms of dynamic code such as JIT code can be > >> addressed by the trampfd framework. > > I don't see why it's necessary for the kernel to generate code at all. > > If the trampfd creation requests can be trusted, what prevents trusting > > a sealed set of instructions generated in userspace? > > Let us consider a system in which: > - a process is not permitted to have pages with both write and execute > - a process is not permitted to map any file as executable unless it > is properly signed. In other words, cryptographically verified. > > Then, the process cannot execute any code that is runtime generated. > That includes trampolines. Only trampoline code that is part of program > text at build time would be permitted to execute. > > In this scenario, trampfd requests are coming from signed code. So, they > are trusted by the kernel. But trampoline code could be dynamically generated. > The kernel will not trust it. I think this a very hand-wavy argument, as it suggests that generated code is not trusted, but what is effectively a generated bytecode is. If certain codegen can be trusted, then we can add mechanisms to permit the results of this to be mapped r-x. If that is not possible, then the same argument says that trampfd requests cannot be trusted. Thanks, Mark.
> > If you look at the libffi reference patch I have included, the architecture > > specific changes to use trampfd just involve a single C function call to > > a common code function. No idea what libffi is, but it must surely be simpler to rewrite it to avoid nested function definitions. Or find a book from the 1960s on how to do recursive calls and nested functions in FORTRAN-IV. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
> > > If you look at the libffi reference patch I have included, the architecture > > > specific changes to use trampfd just involve a single C function call to > > > a common code function. > > No idea what libffi is, but it must surely be simpler to > rewrite it to avoid nested function definitions. > > Or find a book from the 1960s on how to do recursive > calls and nested functions in FORTRAN-IV. FWIW it is probably as simple as: 1) Put all the 'variables' the nested function accesses into a struct. 2) Add a field for the address of the 'nested' function. 3) Pass the address of the structure down instead of the address of the function. If you aren't in control of the call sites then add the structure to a linked list on a thread-local variable. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 8/4/20 9:33 AM, David Laight wrote: >>> If you look at the libffi reference patch I have included, the architecture >>> specific changes to use trampfd just involve a single C function call to >>> a common code function. > No idea what libffi is, but it must surely be simpler to > rewrite it to avoid nested function definitions. Sorry if I wasn't clear. libffi is a separate use case and GCC nested functions is a separate one. libffi is not used to solve the nested function stuff. For nested functions, GCC generates trampoline code and arranges to place it on the stack and execute it. I agree with your other points about nested function implementation. Madhavan > Or find a book from the 1960s on how to do recursive > calls and nested functions in FORTRAN-IV. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
Hey Mark, I am working on putting together an improved definition of trampfd per Andy's comment. I will try to address your comments in that improved definition. Once I send that out, I will respond to your emails as well. Thanks. Madhavan On 8/4/20 8:55 AM, Mark Rutland wrote: > On Mon, Aug 03, 2020 at 12:58:04PM -0500, Madhavan T. Venkataraman wrote: >> On 7/31/20 1:31 PM, Mark Rutland wrote: >>> On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote: >>>> On 7/30/20 3:54 PM, Andy Lutomirski wrote: >>>>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman >>>>> <madvenka@linux.microsoft.com> wrote: >>>>> When the kernel generates the code for a trampoline, it can hard code data values >>>> in the generated code itself so it does not need PC-relative data referencing. >>>> >>>> And, for ISAs that do support the large offset, we do have to implement and >>>> maintain the code page stuff for different ISAs for each application and library >>>> if we did not use trampfd. >>> Trampoline code is architecture specific today, so I don't see that as a >>> major issue. Common structural bits can probably be shared even if the >>> specifid machine code cannot. >> True. But an implementor may prefer a standard mechanism provided by >> the kernel so all of his architectures can be supported easily with less >> effort. >> >> If you look at the libffi reference patch I have included, the architecture >> specific changes to use trampfd just involve a single C function call to >> a common code function. > Sure but in addition to that each architecture backend had to define a > set of arguments to that. I view the C function is analagous to the > "common structural bits". > > I appreciate that your patch is small today (and architectures seem to > largely align on what they need), but I don't think it's necessarily > true that things will remain so simple as architecture are extended and > their calling conventions evolve, and I also don't think it's clear that > this will work for more complex cases elsewhere. > > [...] > >>>> With the user level trampoline table approach, the data part of the trampoline table >>>> can be hacked by an attacker if an application has a vulnerability. Specifically, the >>>> target PC can be altered to some arbitrary location. Trampfd implements an >>>> "Allowed PCS" context. In the libffi changes, I have created a read-only array of >>>> all ABI handlers used in closures for each architecture. This read-only array >>>> can be used to restrict the PC values for libffi trampolines to prevent hacking. >>>> >>>> To generalize, we can implement security rules/features if the trampoline >>>> object is in the kernel. >>> I don't follow this argument. If it's possible to statically define that >>> in the kernel, it's also possible to do that in userspace without any >>> new kernel support. >> It is not statically defined in the kernel. >> >> Let us take the libffi example. In the 64-bit X86 arch code, there are 3 >> ABI handlers: >> >> ffi_closure_unix64_sse >> ffi_closure_unix64 >> ffi_closure_win64 >> >> I could create an "Allowed PCs" context like this: >> >> struct my_allowed_pcs { >> struct trampfd_values pcs; >> __u64 pc_values[3]; >> }; >> >> const struct my_allowed_pcs my_allowed_pcs = { >> { 3, 0 }, >> (uintptr_t) ffi_closure_unix64_sse, >> (uintptr_t) ffi_closure_unix64, >> (uintptr_t) ffi_closure_win64, >> }; >> >> I have created a read-only array of allowed ABI handlers that closures use. >> >> When I set up the context for a closure trampoline, I could do this: >> >> pwrite(trampfd, &my_allowed_pcs, sizeof(my_allowed_pcs), TRAMPFD_ALLOWED_PCS_OFFSET); >> >> This copies the array into the trampoline object in the kernel. >> When the register context is set for the trampoline, the kernel checks >> the PC register value against allowed PCs. >> >> Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only >> permitted target PCs enforced by the kernel are the ABI handlers. > Sorry, when I said "statically define" meant when you knew legitimate > targets ahead of time when you create the trampoline (i.e. whether you > could enumerate those and know they would not change dynamically). > > My point was that you can achieve the same in userspace if the > trampoline and array of legitimate targets are in read-only memory, > without having to trap to the kernel. > > I think the key point here is that an adversary must be prevented from > altering a trampoline and any associated metadata, and I think that > there are ways of achieving that without having to trap into the kernel, > and without the kernel having to be intimately aware of the calling > conventions used in userspace. > > [...] > >>>> Trampfd is a framework that can be used to implement multiple things. May be, >>>> a few of those things can also be implemented in user land itself. But I think having >>>> just one mechanism to execute dynamic code objects is preferable to having >>>> multiple mechanisms not standardized across all applications. >>> In abstract, having a common interface sounds nice, but in practice >>> elements of this are always architecture-specific (e.g. interactiosn >>> with HW CFI), and that common interface can result in more pain as it >>> doesn't fit naturally into the context that ISAs were designed for (e.g. >>> where control-flow instructions are extended with new semantics). >> In the case of trampfd, the code generation is indeed architecture >> specific. But that is in the kernel. The application is not affected by it. > As an ABI detail, applications are *definitely* affected by this, and it > is wrong to suggest they are not even if you don't have a specific case > in mind today. As this forms a contract between userspace and the kernel > it's overly simplistic to say that it's the kernel's problem > > For example, in the case of BTI on arm64, what should the trampoline > set PSTATE.BTYPE to? Different use-cases *will* want different values, > and not necessarily the value of PSTATE at the instant the call to the > trampoline was made. In the case of libffi specifically using the > original value of PSTATE.BTYPE probably is sound, but other code > sequences may need to restrict/broaden or entirely change that. > >> Again, referring to the libffi reference patch, I have defined wrapper >> functions for trampfd in common code. The architecture specific code >> in libffi only calls the set_context function defined in common code. >> Even this is required only because register names are specific to each >> architecture and the target PC (to the ABI handler) is specific to >> each architecture-ABI combo. >> >>> It also meass that you can't share the rough approach across OSs which >>> do not implement an identical mechanism, so for code abstracting by ISA >>> first, then by platform/ABI, there isn't much saving. >> Why can you not share the same approach across OSes? In fact, >> I have tried to design it so that other OSes can use the same >> mechanism. > Sure, but where they *don't*, you must fall back to the existing > purely-userspace mechanisms, and so a codebase now has the burden of > maintaining two distinct mechanisms. > > Whereas if there's a way of doing this in userspace with (stronger) > enforcement of memory permissions the trampoline code can be common for > when this is present or absent, which is much easier for a codebase rto > maintain, and could make use of weaker existing mechanisms to improve > the situation on systems without the new functionality. > > Thanks, > Mark.
Thanks for the lively discussion. I have tried to answer some of the comments below. On 8/4/20 9:30 AM, Mark Rutland wrote: > >> So, the context is - if security settings in a system disallow a page to have >> both write and execute permissions, how do you allow the execution of >> genuine trampolines that are runtime generated and placed in a data >> page or a stack page? > There are options today, e.g. > > a) If the restriction is only per-alias, you can have distinct aliases > where one is writable and another is executable, and you can make it > hard to find the relationship between the two. > > b) If the restriction is only temporal, you can write instructions into > an RW- buffer, transition the buffer to R--, verify the buffer > contents, then transition it to --X. > > c) You can have two processes A and B where A generates instrucitons into > a buffer that (only) B can execute (where B may be restricted from > making syscalls like write, mprotect, etc). The general principle of the mitigation is W^X. I would argue that the above options are violations of the W^X principle. If they are allowed today, they must be fixed. And they will be. So, we cannot rely on them. a) This requires a remap operation. Two mappings point to the same physical page. One mapping has W and the other one has X. This is a violation of W^X. b) This is again a violation. The kernel should refuse to give execute permission to a page that was writeable in the past and refuse to give write permission to a page that was executable in the past. c) This is just a variation of (a). In general, the problem with user-level methods to map and execute dynamic code is that the kernel cannot tell if a genuine application is using them or an attacker is using them or piggy-backing on them. If a security subsystem blocks all user-level methods for this reason, we need a kernel mechanism to deal with the problem. The kernel mechanism is not to be a backdoor. It is there to define ways in which safe dynamic code can be executed. I admit I have to provide more proof that my API and framework can cover different cases. So, that is what I am doing now. I am in the process of identifying other examples (per Andy's comment) and attempting to show that this API and framework can address them. It will take a little time. >> >> IIUC, you are suggesting that the user hands the kernel a code fragment >> and requests it to be placed in an r-x page, correct? However, the >> kernel cannot trust any code given to it by the user. Nor can it scan any >> piece of code and reliably decide if it is safe or not. > Per that same logic the kernel cannot trust trampfd creation calls to be > legitimate as the adversary could mess with the arguments. It doesn't > matter if the kernel's codegen is trustworthy if it's potentially driven > by an adversary. That is not true. IMO, this is not a deficiency in trampfd. This is something that is there even for regular system calls. For instance, the write() system call will faithfully write out a buffer to a file even if the buffer contents have been hacked by an attacker. A system call can perform certain checks on incoming arguments. But it cannot tell if a hacker has modified them. So, there are two aspects in dynamic code that I am considering - data and code. I submit that the data part can be hacked if an application has a vulnerability such as buffer overflow. I don't see how we can ever help that. So, I am focused on the code generation part. Not all dynamic code is the same. They have different degrees of trust. Off the top of my head, I have tried to identify some examples where we can have more trust on dynamic code and have the kernel permit its execution. 1. If the kernel can do the job, then that is one safe way. Here, the kernel is the code. There is no code generation involved. This is what I have presented in the patch series as the first cut. 2. If the kernel can generate the code, then that code has a measure of trust. For trampolines, I agreed to do this for performance. 3. If the code resides in a signed file, then we know that it comes from an known source and it was generated at build time. So, it is not hacker generated. So, there is a measure of trust. This is not just program text. This could also be a buffer that contains trampoline code that resides in the read-only data section of a binary. 4. If the code resides in a signed file and is emulated (e.g. by QEMU) and we generate code for dynamic binary translation, we should be able to do that provided the code generator itself is not suspect. See the next point. 5. The above are examples of actual machine code or equivalent. We could also have source code from which we generate machine code. E.g., JIT code from Java byte code. In this case, if the source code is in a signed file, we have a measure of trust on the source. If the kernel uses its own trusted code generator to generate the object code from the source code, then that object code has a measure of trust. Anyway, these are just examples. The principle is - if we can identify dynamic code that has a certain measure of trust, can the kernel permit their execution? All other code that cannot really be trusted by the kernel cannot be executed safely (unless we find some safe and efficient way to sandbox such code and limit the effects of the code to within the sandbox). This is outside the scope of what I am doing. >> So, the problem of executing dynamic code when security settings are >> restrictive cannot be solved in userland. The only option I can think of is >> to have the kernel provide support for dynamic code. It must have one >> or more safe, trusted code generation components and an API to use >> the components. >> >> My goal is to introduce an API and start off by supporting simple, regular >> trampolines that are widely used. Then, evolve the feature over a period >> of time to include other forms of dynamic code such as JIT code. > I think that you're making a leap to this approach without sufficient > justification that it actually solves the problem, and I believe that > there will be ABI issues with this approach which can be sidestepped by > other potential approaches. > > Taking a step back, I think it's necessary to better describe the > problem and constraints that you believe apply before attempting to > justify any potential solution. I totally agree that more justification is needed and I am working on it. As I have mentioned above, I intend to have the kernel generate code only if the code generation is simple enough. For more complicated cases, I plan to use a user-level code generator that is for exclusive kernel use. I have yet to work out the details on how this would work. Need time. > > [...] > >> >> 1. Create a trampoline by calling trampfd_create() >> 2. Set the register and/or stack contexts for the trampoline. >> 3. mmap() the trampoline to get an address >> 4a. Retrieve the register and stack context for the trampoline from the >> kernel and check if anything has been altered. If yes, abort. >> 4b. Invoke the trampoline using the address > As above, you can also do this when using mprotect today, transitioning > the buffer RWX -> R-- -> R-X. If you're worried about subsequent > modification via an alias, a sealed memfd would work assuming that can > be mapped R-X. This is a violation of W^X and the security subsystem must be fixed if it permits it. > This approach is applicable to trampfd, but it isn't a specific benefit > of trampfd. > > [...] > >>>> - In the future, if the kernel can be enhanced to use a safe code >>>> generation component, that code can be placed in the trampoline mapping >>>> pages. Then, the trampoline invocation does not have to incur a trip >>>> into the kernel. >>>> >>>> - Also, if the kernel can be enhanced to use a safe code generation >>>> component, other forms of dynamic code such as JIT code can be >>>> addressed by the trampfd framework. >>> I don't see why it's necessary for the kernel to generate code at all. >>> If the trampfd creation requests can be trusted, what prevents trusting >>> a sealed set of instructions generated in userspace? >> Let us consider a system in which: >> - a process is not permitted to have pages with both write and execute >> - a process is not permitted to map any file as executable unless it >> is properly signed. In other words, cryptographically verified. >> >> Then, the process cannot execute any code that is runtime generated. >> That includes trampolines. Only trampoline code that is part of program >> text at build time would be permitted to execute. >> >> In this scenario, trampfd requests are coming from signed code. So, they >> are trusted by the kernel. But trampoline code could be dynamically generated. >> The kernel will not trust it. > I think this a very hand-wavy argument, as it suggests that generated > code is not trusted, but what is effectively a generated bytecode is. > If certain codegen can be trusted, then we can add mechanisms to permit > the results of this to be mapped r-x. If that is not possible, then the > same argument says that trampfd requests cannot be trusted. There is certainly an extra measure of trust in code that is in signature verified files as compared to code that is generated on the fly. At least, we know that the place from which we get that code is known and the file was generated at build time and not hacker generated. Such files could still contain a vulnerability. But because these files are maintained by a known source, chances are that there is nothing malicious in them. Thanks. Madhavan
Hi! > Thanks for the lively discussion. I have tried to answer some of the > comments below. > > There are options today, e.g. > > > > a) If the restriction is only per-alias, you can have distinct aliases > > where one is writable and another is executable, and you can make it > > hard to find the relationship between the two. > > > > b) If the restriction is only temporal, you can write instructions into > > an RW- buffer, transition the buffer to R--, verify the buffer > > contents, then transition it to --X. > > > > c) You can have two processes A and B where A generates instrucitons into > > a buffer that (only) B can execute (where B may be restricted from > > making syscalls like write, mprotect, etc). > > The general principle of the mitigation is W^X. I would argue that > the above options are violations of the W^X principle. If they are > allowed today, they must be fixed. And they will be. So, we cannot > rely on them. Would you mind describing your threat model? Because I believe you are using model different from everyone else. In particular, I don't believe b) is a problem or should be fixed. I'll add d), application mmaps a file(R--), and uses write syscall to change trampolines in it. > b) This is again a violation. The kernel should refuse to give execute > ???????? permission to a page that was writeable in the past and refuse to > ???????? give write permission to a page that was executable in the past. Why? Pavel
Resending because of mailer problems. Some of the recipients did not receive my email. I apologize. Sigh. Here is a redefinition of trampfd based on review comments. I wanted to address dynamic code in 3 different ways: Remove the need for dynamic code where possible -------------------------------------------------------------------- If the kernel itself can perform the work of some dynamic code, then the code can be replaced by the kernel. This is what I implemented in the patchset. But reviewers objected to the performance impact. One trip to the kernel was needed for each trampoline invocation. So, I have decided to defer this approach. Convert dynamic code to static code where possible ---------------------------------------------------------------------- This is possible with help from the kernel. This has no performance impact and can be used in libffi, GCC nested functions, etc. I have described the approach below. Deal with code generation ----------------------------------- For cases like generating JIT code from Java byte code, I wanted to establish a framework. However, reviewers felt that details are missing. Should the kernel generate code or should it use a user-level code generator? How do you make sure that a user level code generator can be trusted? How would the communication work? ABI details? Architecture support? Support for different types - JIT, DBT, etc? I have come to the conclusion that this is best done separately. My main interest is to provide a way to convert dynamic code such as trampolines to static code without any special architecture support. This can be done with the kernel's help. Any code that gets written in the future can conform to this as well. So, in version 2 of the Trampfd RFC, I would like to simplify trampfd and just address item 2. I will reimplement the support in libffi and present it. Convert dynamic code to static code ------------------------------------------------ One problem with dynamic code is that it cannot be verified or authenticated by the kernel. The kernel cannot tell the difference between genuine dynamic code and an attacker's code. Where possible, dynamic code should be converted to static code and placed in the text segment of a binary file. This allows the kernel to verify the code by verifying the signature of the file. The other problem is using user-level methods to load and execute dynamic code can potentially be exploited by an attacker to inject his code and have it be executed. To prevent this, a system may enforce W^X. If W^X is enforced properly, genuine dynamic code will not be able to run. This is another reason to convert dynamic code to static code. The issue in converting dynamic code to static code is that the data is dynamic. The code does not know before hand where the data is going to be at runtime. Some architectures support PC-relative data references. So, if you co-locate code and data, then the code can find the data at runtime. But this is not supported on all architectures. When supported, there may be limitations to deal with. Plus you have to take the trouble to co-locate code and data. And, to deal with W^X, code and data need to be in different pages. All architectures must be supported without any limitations. Fortunately, the kernel can solve this problem quite easily. I suggest the following: Convert dynamic code to static code like this: - Decide which register should point to the data that the code needs. Call it register R. - Write the static code assuming that R already points to the data. - Use trampfd and pass the following to the kernel: - pointers to the code and data - the name of the register R The kernel will write the following instructions in a trampoline page mapped into the caller's address space with R-X. - Load the data address in register R - Jump to the static code Basically, the kernel provides a trampoline to jump to the user's code and returns the kernel-provided trampoline's address to the user. It is trivial to implement a trampoline table in the trampoline page to conserve memory. Issues raised previously ------------------------------- I believe that the following issues that were raised by reviewers is not a problem in this scheme. Please rereview. - Florian mentioned the libffi trampoline table. Trampoline tables can be implemented in this scheme easily. - Florian mentioned stack unwinders. I am not an expert on unwinders. But I don't see an issue with unwinders. - Mark Rutland mentioned Intel's CET and CFI. Don't see a problem there. - Mark Rutland mentioned PAC+BTI on ARM64. Don't see a problem there. If I have missed addressing any previously raised issue, I apologize. Please let me know. Thanks! Madhavan
On 8/8/20 5:17 PM, Pavel Machek wrote: > Hi! > >> Thanks for the lively discussion. I have tried to answer some of the >> comments below. > >>> There are options today, e.g. >>> >>> a) If the restriction is only per-alias, you can have distinct aliases >>> where one is writable and another is executable, and you can make it >>> hard to find the relationship between the two. >>> >>> b) If the restriction is only temporal, you can write instructions into >>> an RW- buffer, transition the buffer to R--, verify the buffer >>> contents, then transition it to --X. >>> >>> c) You can have two processes A and B where A generates instrucitons into >>> a buffer that (only) B can execute (where B may be restricted from >>> making syscalls like write, mprotect, etc). >> >> The general principle of the mitigation is W^X. I would argue that >> the above options are violations of the W^X principle. If they are >> allowed today, they must be fixed. And they will be. So, we cannot >> rely on them. > > Would you mind describing your threat model? > > Because I believe you are using model different from everyone else. > > In particular, I don't believe b) is a problem or should be fixed. It is a problem because a kernel that implements W^X properly will not allow it. It has no idea what has been done in userland. It has no idea that the user has checked and verified the buffer contents after transitioning the page to R--. > > I'll add d), application mmaps a file(R--), and uses write syscall to change > trampolines in it. > No matter how you do it, these are all user-level methods that can be hacked. The kernel cannot be sure that an attacker's code has not found its way into the file. >> b) This is again a violation. The kernel should refuse to give execute >> ???????? permission to a page that was writeable in the past and refuse to >> ???????? give write permission to a page that was executable in the past. > > Why? I don't know about the latter part. I guess I need to think about it. But the former is valid. When a page is RW-, a hacker could hack the page. Then it does not matter that the page is transitioned to R--. Again, the kernel cannot be sure that the user has verified the contents after R--. IMO, W^X needs to be enforced temporally as well. Madhavan
Hi! > >> Thanks for the lively discussion. I have tried to answer some of the > >> comments below. > > > >>> There are options today, e.g. > >>> > >>> a) If the restriction is only per-alias, you can have distinct aliases > >>> where one is writable and another is executable, and you can make it > >>> hard to find the relationship between the two. > >>> > >>> b) If the restriction is only temporal, you can write instructions into > >>> an RW- buffer, transition the buffer to R--, verify the buffer > >>> contents, then transition it to --X. > >>> > >>> c) You can have two processes A and B where A generates instrucitons into > >>> a buffer that (only) B can execute (where B may be restricted from > >>> making syscalls like write, mprotect, etc). > >> > >> The general principle of the mitigation is W^X. I would argue that > >> the above options are violations of the W^X principle. If they are > >> allowed today, they must be fixed. And they will be. So, we cannot > >> rely on them. > > > > Would you mind describing your threat model? > > > > Because I believe you are using model different from everyone else. > > > > In particular, I don't believe b) is a problem or should be fixed. > > It is a problem because a kernel that implements W^X properly > will not allow it. It has no idea what has been done in userland. > It has no idea that the user has checked and verified the buffer > contents after transitioning the page to R--. No, it is not a problem. W^X is designed to protect from attackers doing buffer overflows, not attackers doing arbitrary syscalls. Best regards, Pavel
On 8/11/20 8:08 AM, Pavel Machek wrote: > Hi! > >>>> Thanks for the lively discussion. I have tried to answer some of the >>>> comments below. >>> >>>>> There are options today, e.g. >>>>> >>>>> a) If the restriction is only per-alias, you can have distinct aliases >>>>> where one is writable and another is executable, and you can make it >>>>> hard to find the relationship between the two. >>>>> >>>>> b) If the restriction is only temporal, you can write instructions into >>>>> an RW- buffer, transition the buffer to R--, verify the buffer >>>>> contents, then transition it to --X. >>>>> >>>>> c) You can have two processes A and B where A generates instrucitons into >>>>> a buffer that (only) B can execute (where B may be restricted from >>>>> making syscalls like write, mprotect, etc). >>>> >>>> The general principle of the mitigation is W^X. I would argue that >>>> the above options are violations of the W^X principle. If they are >>>> allowed today, they must be fixed. And they will be. So, we cannot >>>> rely on them. >>> >>> Would you mind describing your threat model? >>> >>> Because I believe you are using model different from everyone else. >>> >>> In particular, I don't believe b) is a problem or should be fixed. >> >> It is a problem because a kernel that implements W^X properly >> will not allow it. It has no idea what has been done in userland. >> It has no idea that the user has checked and verified the buffer >> contents after transitioning the page to R--. > > No, it is not a problem. W^X is designed to protect from attackers > doing buffer overflows, not attackers doing arbitrary syscalls. > Hey Pavel, You are correct. The W^X implementation today still has some holes. IIUC, the principle of W^X is - user should not be able to (W) write code into a page and use some trick to get it to (X) execute. So, what I was trying to say was that the W^X principle is not implemented completely today. Mark Rutland mentioned some other tricks as well which are being used today. For instance, Microsoft has submitted this proposal: https://microsoft.github.io/ipe/ IPE is an LSM. In this proposal, only mappings that are backed by a signature verified file can have execute permissions. This means that all anonymous page based tricks will fail. And, file mapping based tricks will fail as well when temporary files are used to load code and mmap(). That is the intent. Thanks! Madhavan
I am working on version 2 of trampfd. Will send it out soon. Thanks for all the comments so far! Madhavan On 8/10/20 12:34 PM, Madhavan T. Venkataraman wrote: > Resending because of mailer problems. Some of the recipients did not receive > my email. I apologize. Sigh. > > Here is a redefinition of trampfd based on review comments. > > I wanted to address dynamic code in 3 different ways: > > Remove the need for dynamic code where possible > -------------------------------------------------------------------- > > If the kernel itself can perform the work of some dynamic code, then > the code can be replaced by the kernel. > > This is what I implemented in the patchset. But reviewers objected > to the performance impact. One trip to the kernel was needed for each > trampoline invocation. So, I have decided to defer this approach. > > Convert dynamic code to static code where possible > ---------------------------------------------------------------------- > > This is possible with help from the kernel. This has no performance > impact and can be used in libffi, GCC nested functions, etc. I have > described the approach below. > > Deal with code generation > ----------------------------------- > > For cases like generating JIT code from Java byte code, I wanted to > establish a framework. However, reviewers felt that details are missing. > > Should the kernel generate code or should it use a user-level code generator? > How do you make sure that a user level code generator can be trusted? > How would the communication work? ABI details? Architecture support? > Support for different types - JIT, DBT, etc? > > I have come to the conclusion that this is best done separately. > > My main interest is to provide a way to convert dynamic code such as > trampolines to static code without any special architecture support. > This can be done with the kernel's help. Any code that gets written in > the future can conform to this as well. > > So, in version 2 of the Trampfd RFC, I would like to simplify trampfd and > just address item 2. I will reimplement the support in libffi and present it. > > Convert dynamic code to static code > ------------------------------------------------ > > One problem with dynamic code is that it cannot be verified or authenticated > by the kernel. The kernel cannot tell the difference between genuine dynamic > code and an attacker's code. Where possible, dynamic code should be converted > to static code and placed in the text segment of a binary file. This allows > the kernel to verify the code by verifying the signature of the file. > > The other problem is using user-level methods to load and execute dynamic code > can potentially be exploited by an attacker to inject his code and have it be > executed. To prevent this, a system may enforce W^X. If W^X is enforced > properly, genuine dynamic code will not be able to run. This is another > reason to convert dynamic code to static code. > > The issue in converting dynamic code to static code is that the data is > dynamic. The code does not know before hand where the data is going to be > at runtime. > > Some architectures support PC-relative data references. So, if you co-locate > code and data, then the code can find the data at runtime. But this is not > supported on all architectures. When supported, there may be limitations to > deal with. Plus you have to take the trouble to co-locate code and data. > And, to deal with W^X, code and data need to be in different pages. > > All architectures must be supported without any limitations. Fortunately, > the kernel can solve this problem quite easily. I suggest the following: > > Convert dynamic code to static code like this: > > - Decide which register should point to the data that the code needs. > Call it register R. > > - Write the static code assuming that R already points to the data. > > - Use trampfd and pass the following to the kernel: > > - pointers to the code and data > - the name of the register R > > The kernel will write the following instructions in a trampoline page > mapped into the caller's address space with R-X. > > - Load the data address in register R > - Jump to the static code > > Basically, the kernel provides a trampoline to jump to the user's code > and returns the kernel-provided trampoline's address to the user. > > It is trivial to implement a trampoline table in the trampoline page to > conserve memory. > > Issues raised previously > ------------------------------- > > I believe that the following issues that were raised by reviewers is not > a problem in this scheme. Please rereview. > > - Florian mentioned the libffi trampoline table. Trampoline tables can be > implemented in this scheme easily. > > - Florian mentioned stack unwinders. I am not an expert on unwinders. > But I don't see an issue with unwinders. > > - Mark Rutland mentioned Intel's CET and CFI. Don't see a problem there. > > - Mark Rutland mentioned PAC+BTI on ARM64. Don't see a problem there. > > If I have missed addressing any previously raised issue, I apologize. > Please let me know. > > Thanks! > > Madhavan > >
On Thu, Aug 06, 2020 at 12:26:02PM -0500, Madhavan T. Venkataraman wrote: > Thanks for the lively discussion. I have tried to answer some of the > comments below. > > On 8/4/20 9:30 AM, Mark Rutland wrote: > > > >> So, the context is - if security settings in a system disallow a page to have > >> both write and execute permissions, how do you allow the execution of > >> genuine trampolines that are runtime generated and placed in a data > >> page or a stack page? > > There are options today, e.g. > > > > a) If the restriction is only per-alias, you can have distinct aliases > > where one is writable and another is executable, and you can make it > > hard to find the relationship between the two. > > > > b) If the restriction is only temporal, you can write instructions into > > an RW- buffer, transition the buffer to R--, verify the buffer > > contents, then transition it to --X. > > > > c) You can have two processes A and B where A generates instrucitons into > > a buffer that (only) B can execute (where B may be restricted from > > making syscalls like write, mprotect, etc). > > The general principle of the mitigation is W^X. I would argue that > the above options are violations of the W^X principle. If they are > allowed today, they must be fixed. And they will be. So, we cannot > rely on them. Hold on. Contemporary W^X means that a given virtual alias cannot be writeable and executeable simultaneously, permitting (a) and (b). If you read the references on the Wikipedia page for W^X you'll see the OpenBSD 3.3 release notes and related presentation make this clear, and further they expect (b) to occur with JITS flipping W/X with mprotect(). Please don't conflate your assumed stronger semantics with the general principle. It not matching you expectations does not necessarily mean that it is wrong. If you want a stronger W^X semantics, please refer to this specifically with a distinct name. > a) This requires a remap operation. Two mappings point to the same > physical page. One mapping has W and the other one has X. This > is a violation of W^X. > > b) This is again a violation. The kernel should refuse to give execute > permission to a page that was writeable in the past and refuse to > give write permission to a page that was executable in the past. > > c) This is just a variation of (a). As above, this is not true. If you have a rationale for why this is desirable or necessary, please justify that before using this as justification for additional features. > In general, the problem with user-level methods to map and execute > dynamic code is that the kernel cannot tell if a genuine application is > using them or an attacker is using them or piggy-backing on them. Yes, and as I pointed out the same is true for trampfd unless you can somehow authenticate the calls are legitimate (in both callsite and the set of arguments), and I don't see any reasonable way of doing that. If you relax your threat model to an attacker not being able to make arbitrary syscalls, then your suggestion that userspace can perorm chceks between syscalls may be sufficient, but as I pointed out that's equally true for a sealed memfd or similar. > Off the top of my head, I have tried to identify some examples > where we can have more trust on dynamic code and have the kernel > permit its execution. > > 1. If the kernel can do the job, then that is one safe way. Here, the kernel > is the code. There is no code generation involved. This is what I > have presented in the patch series as the first cut. This is sleight-of-hand; it doesn't matter where the logic is performed if the power is identical. Practically speaking this is equivalent to some dynamic code generation. I think that it's misleading to say that because the kernel emulates something it is safe when the provenance of the syscall arguments cannot be verified. [...] > Anyway, these are just examples. The principle is - if we can identify > dynamic code that has a certain measure of trust, can the kernel > permit their execution? My point generally is that the kernel cannot identify this, and if usrspace code is trusted to dynamically generate trampfd arguments it can equally be trusted to dyncamilly generate code. [...] > As I have mentioned above, I intend to have the kernel generate code > only if the code generation is simple enough. For more complicated cases, > I plan to use a user-level code generator that is for exclusive kernel use. > I have yet to work out the details on how this would work. Need time. This reads to me like trampfd is only dealing with a few special cases and we know that we need a more general solution. I hope I am mistaken, but I get the strong impression that you're trying to justify your existing solution rather than trying to understand the problem space. To be clear, my strong opinion is that we should not be trying to do this sort of emulation or code generation within the kernel. I do think it's worthwhile to look at mechanisms to make it harder to subvert dynamic userspace code generation, but I think the code generation itself needs to live in userspace (e.g. for ABI reasons I previously mentioned). Mark.
On 8/12/20 5:06 AM, Mark Rutland wrote: > [..] >> >> The general principle of the mitigation is W^X. I would argue that >> the above options are violations of the W^X principle. If they are >> allowed today, they must be fixed. And they will be. So, we cannot >> rely on them. > > Hold on. > > Contemporary W^X means that a given virtual alias cannot be writeable > and executeable simultaneously, permitting (a) and (b). If you read the > references on the Wikipedia page for W^X you'll see the OpenBSD 3.3 > release notes and related presentation make this clear, and further they > expect (b) to occur with JITS flipping W/X with mprotect(). > > Please don't conflate your assumed stronger semantics with the general > principle. It not matching you expectations does not necessarily mean > that it is wrong. > > If you want a stronger W^X semantics, please refer to this specifically > with a distinct name. OK. Fair enough. We can give a different name to the stronger requirement. Just for the sake of this discussion and for the want of a better name, let us call it WX2. > >> a) This requires a remap operation. Two mappings point to the same >> physical page. One mapping has W and the other one has X. This >> is a violation of W^X. >> >> b) This is again a violation. The kernel should refuse to give execute >> permission to a page that was writeable in the past and refuse to >> give write permission to a page that was executable in the past. >> >> c) This is just a variation of (a). > > As above, this is not true. > > If you have a rationale for why this is desirable or necessary, please > justify that before using this as justification for additional features. > I already supplied the justification. Any user level method can potentially be hijacked by an attacker for his purpose. WX does not prevent all of the methods. We need WX2. >> In general, the problem with user-level methods to map and execute >> dynamic code is that the kernel cannot tell if a genuine application is >> using them or an attacker is using them or piggy-backing on them. > > Yes, and as I pointed out the same is true for trampfd unless you can > somehow authenticate the calls are legitimate (in both callsite and the > set of arguments), and I don't see any reasonable way of doing that. > I am afraid I am not in agreement with this. If WX2 is not implemented, an attacker can hack both code and data. If WX2 is implemented, an attacker can only attack data. The attack surface is reduced. Also, trampfd calls coming from code from a signed file can be authenticated. trampfd calls coming from an attacker's generated code cannot be authenticated. > If you relax your threat model to an attacker not being able to make > arbitrary syscalls, then your suggestion that userspace can perorm > chceks between syscalls may be sufficient, but as I pointed out that's > equally true for a sealed memfd or similar. > Actually, I did not suggest that userspace can perform checks. I said that the kernel can perform checks. User space cannot reliably perform checks between calls. A clever hacker can cover his tracks. In any case, the kernel has no knowledge of these checks. So, when execute permissions are requested for a page, a properly implemented WX2 can refuse. >> Off the top of my head, I have tried to identify some examples >> where we can have more trust on dynamic code and have the kernel >> permit its execution. >> >> 1. If the kernel can do the job, then that is one safe way. Here, the kernel >> is the code. There is no code generation involved. This is what I >> have presented in the patch series as the first cut. > > This is sleight-of-hand; it doesn't matter where the logic is performed > if the power is identical. Practically speaking this is equivalent to > some dynamic code generation. > > I think that it's misleading to say that because the kernel emulates > something it is safe when the provenance of the syscall arguments cannot > be verified. I submit that there are two aspects - code and data. In one case, both code and data can be hacked. So, an attacker can modify both code and data. In the other case, the attacker can only modify data. The power is not identical. The attack surface is not the same. Most of the times, security measures are mitigations. They are not a 100%. This approach of not allowing the user to do certain things that can be exploited and having the kernel doing them increases our confidence. From that perspective, the two approaches are different and it is worth pursuing a kernel based mitigation. > > [...] > >> Anyway, these are just examples. The principle is - if we can identify >> dynamic code that has a certain measure of trust, can the kernel >> permit their execution? > > My point generally is that the kernel cannot identify this, and if > usrspace code is trusted to dynamically generate trampfd arguments it > can equally be trusted to dyncamilly generate code. I am afraid not. See my previous response. Ability to hack only data gives an attacker fewer options as compared to the ability to hack both code and data. > > [...] > >> As I have mentioned above, I intend to have the kernel generate code >> only if the code generation is simple enough. For more complicated cases, >> I plan to use a user-level code generator that is for exclusive kernel use. >> I have yet to work out the details on how this would work. Need time. > > This reads to me like trampfd is only dealing with a few special cases > and we know that we need a more general solution. > > I hope I am mistaken, but I get the strong impression that you're trying > to justify your existing solution rather than trying to understand the > problem space. > I do understand the problem space. I wanted to address dynamic code in 3 different ways in separate phases starting from the easiest and working my way up to the more difficult ones. 1. Remove dynamic code where possible If the kernel can replace user level dynamic code, then do it. This is what I did in version 1. 2. Replace dynamic code with static code Where you cannot do (1), replace dynamic code with static code with the kernel's help. I wanted to do this later. But I have decided to do this in version 2. This combined with signature verification of files adds a measure or trust in the code. 3. Deal with JIT, DBT, etc In (1) and (2), we deal with machine code. In (3), there is some source from which dynamic code needs to be generated using a code generator. E.g., JIT code from Java byte code. Here, the solution I had in mind had two parts: - Make the source more trustworthy by requiring it to be part of a signed file - Design a code generator trusted and used exclusively by the kernel In this patchset, I wanted to lay a foundation for all 3 and attempt to solve (1) first. Once this was in place, I wanted to do (2) and then (3). In retrospect, I should have probably started with the big picture first instead of starting with just item (1). But I always had the big picture in mind. That said, I did not necessarily have all the details fleshed out for all the phases. (3) is complex. My focus was to define the API in a generic enough fashion so that all 3 phases can be implemented. But I realize that it is a hard sell at this point to convince people that the API is adequate for phase 3. So, I have decided to do (1) and (2). (3) has to be done separately with more thought and details put into it. Also, it may be the case that there are some examples of dynamic code out there than can never be addressed. My goal is to try to address a majority of the dynamic code out there. > To be clear, my strong opinion is that we should not be trying to do > this sort of emulation or code generation within the kernel. I do think > it's worthwhile to look at mechanisms to make it harder to subvert > dynamic userspace code generation, but I think the code generation > itself needs to live in userspace (e.g. for ABI reasons I previously > mentioned). > I completely agree that the kernel should not deal with the complexities of code generation and ABI details. My version 1 did not have any code generation. But since a performance issue was raised, I explored the idea of kernel code generation. To be honest, I was not really that comfortable with the idea. That is why I have decided to implement the second piece I had in my plan now. This piece does not have the code generation complexities or ABI issues. This piece can be used to solve libffi, GCC, etc. I will still write the code in such a way that I can use the first approach in the future if I really need it. But it will not involve any code generation from the kernel. It will only be used for cases that don't mind the extra trip to the kernel. Madhavan
On 12/08/2020 12:06, Mark Rutland wrote: > On Thu, Aug 06, 2020 at 12:26:02PM -0500, Madhavan T. Venkataraman wrote: >> Thanks for the lively discussion. I have tried to answer some of the >> comments below. >> >> On 8/4/20 9:30 AM, Mark Rutland wrote: >>> >>>> So, the context is - if security settings in a system disallow a page to have >>>> both write and execute permissions, how do you allow the execution of >>>> genuine trampolines that are runtime generated and placed in a data >>>> page or a stack page? >>> There are options today, e.g. >>> >>> a) If the restriction is only per-alias, you can have distinct aliases >>> where one is writable and another is executable, and you can make it >>> hard to find the relationship between the two. >>> >>> b) If the restriction is only temporal, you can write instructions into >>> an RW- buffer, transition the buffer to R--, verify the buffer >>> contents, then transition it to --X. >>> >>> c) You can have two processes A and B where A generates instrucitons into >>> a buffer that (only) B can execute (where B may be restricted from >>> making syscalls like write, mprotect, etc). >> >> The general principle of the mitigation is W^X. I would argue that >> the above options are violations of the W^X principle. If they are >> allowed today, they must be fixed. And they will be. So, we cannot >> rely on them. > > Hold on. > > Contemporary W^X means that a given virtual alias cannot be writeable > and executeable simultaneously, permitting (a) and (b). If you read the > references on the Wikipedia page for W^X you'll see the OpenBSD 3.3 > release notes and related presentation make this clear, and further they > expect (b) to occur with JITS flipping W/X with mprotect(). W^X (with "permanent" mprotect restrictions [1]) goes back to 2000 with PaX [2] (which predates partial OpenBSD implementation from 2003). [1] https://pax.grsecurity.net/docs/mprotect.txt [2] https://undeadly.org/cgi?action=article;sid=20030417082752 > > Please don't conflate your assumed stronger semantics with the general > principle. It not matching you expectations does not necessarily mean > that it is wrong. > > If you want a stronger W^X semantics, please refer to this specifically > with a distinct name. > >> a) This requires a remap operation. Two mappings point to the same >> physical page. One mapping has W and the other one has X. This >> is a violation of W^X. >> >> b) This is again a violation. The kernel should refuse to give execute >> permission to a page that was writeable in the past and refuse to >> give write permission to a page that was executable in the past. >> >> c) This is just a variation of (a). > > As above, this is not true. > > If you have a rationale for why this is desirable or necessary, please > justify that before using this as justification for additional features. > >> In general, the problem with user-level methods to map and execute >> dynamic code is that the kernel cannot tell if a genuine application is >> using them or an attacker is using them or piggy-backing on them. > > Yes, and as I pointed out the same is true for trampfd unless you can > somehow authenticate the calls are legitimate (in both callsite and the > set of arguments), and I don't see any reasonable way of doing that. > > If you relax your threat model to an attacker not being able to make > arbitrary syscalls, then your suggestion that userspace can perorm > chceks between syscalls may be sufficient, but as I pointed out that's > equally true for a sealed memfd or similar. > >> Off the top of my head, I have tried to identify some examples >> where we can have more trust on dynamic code and have the kernel >> permit its execution. >> >> 1. If the kernel can do the job, then that is one safe way. Here, the kernel >> is the code. There is no code generation involved. This is what I >> have presented in the patch series as the first cut. > > This is sleight-of-hand; it doesn't matter where the logic is performed > if the power is identical. Practically speaking this is equivalent to > some dynamic code generation. > > I think that it's misleading to say that because the kernel emulates > something it is safe when the provenance of the syscall arguments cannot > be verified. > > [...] > >> Anyway, these are just examples. The principle is - if we can identify >> dynamic code that has a certain measure of trust, can the kernel >> permit their execution? > > My point generally is that the kernel cannot identify this, and if > usrspace code is trusted to dynamically generate trampfd arguments it > can equally be trusted to dyncamilly generate code. > > [...] > >> As I have mentioned above, I intend to have the kernel generate code >> only if the code generation is simple enough. For more complicated cases, >> I plan to use a user-level code generator that is for exclusive kernel use. >> I have yet to work out the details on how this would work. Need time. > > This reads to me like trampfd is only dealing with a few special cases > and we know that we need a more general solution. > > I hope I am mistaken, but I get the strong impression that you're trying > to justify your existing solution rather than trying to understand the > problem space. > > To be clear, my strong opinion is that we should not be trying to do > this sort of emulation or code generation within the kernel. I do think > it's worthwhile to look at mechanisms to make it harder to subvert > dynamic userspace code generation, but I think the code generation > itself needs to live in userspace (e.g. for ABI reasons I previously > mentioned). > > Mark. >
On Wed, Aug 19, 2020 at 08:53:42PM +0200, Mickaël Salaün wrote: > On 12/08/2020 12:06, Mark Rutland wrote: > > Contemporary W^X means that a given virtual alias cannot be writeable > > and executeable simultaneously, permitting (a) and (b). If you read the > > references on the Wikipedia page for W^X you'll see the OpenBSD 3.3 > > release notes and related presentation make this clear, and further they > > expect (b) to occur with JITS flipping W/X with mprotect(). > > W^X (with "permanent" mprotect restrictions [1]) goes back to 2000 with > PaX [2] (which predates partial OpenBSD implementation from 2003). > > [1] https://pax.grsecurity.net/docs/mprotect.txt > [2] https://undeadly.org/cgi?action=article;sid=20030417082752 Thanks for the pointers! Mark.
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> Introduction ------------ Trampolines are used in many different user applications. Trampoline code is often generated at runtime. Trampoline code can also just be a pre-defined sequence of machine instructions in a data buffer. Trampoline code is placed either in a data page or in a stack page. In order to execute a trampoline, the page it resides in needs to be mapped with execute permissions. Writable pages with execute permissions provide an attack surface for hackers. Attackers can use this to inject malicious code, modify existing code or do other harm. To mitigate this, LSMs such as SELinux may not allow pages to have both write and execute permissions. This prevents trampolines from executing and blocks applications that use trampolines. To allow genuine applications to run, exceptions have to be made for them (by setting execmem, etc). In this case, the attack surface is just the pages of such applications. An application that is not allowed to have writable executable pages may try to load trampoline code into a file and map the file with execute permissions. In this case, the attack surface is just the buffer that contains trampoline code. However, a successful exploit may provide the hacker with means to load his own code in a file, map it and execute it. LSMs (such as the IPE proposal [1]) may allow only properly signed object files to be mapped with execute permissions. This will prevent trampoline files from being mapped. Again, exceptions have to be made for genuine applications. We need a way to execute trampolines without making security exceptions where possible and to reduce the attack surface even further. Examples of trampolines ----------------------- libffi (A Portable Foreign Function Interface Library): libffi allows a user to define functions with an arbitrary list of arguments and return value through a feature called "Closures". Closures use trampolines to jump to ABI handlers that handle calling conventions and call a target function. libffi is used by a lot of different applications. To name a few: - Python - Java - Javascript - Ruby FFI - Lisp - Objective C GCC nested functions: GCC has traditionally used trampolines for implementing nested functions. The trampoline is placed on the user stack. So, the stack needs to be executable. Currently available solution ---------------------------- One solution that has been proposed to allow trampolines to be executed without making security exceptions is Trampoline Emulation. See: https://pax.grsecurity.net/docs/emutramp.txt In this solution, the kernel recognizes certain sequences of instructions as "well-known" trampolines. When such a trampoline is executed, a page fault happens because the trampoline page does not have execute permission. The kernel recognizes the trampoline and emulates it. Basically, the kernel does the work of the trampoline on behalf of the application. Here, the attack surface is the buffer that contains the trampoline. The attack surface is narrower than before. A hacker may still be able to modify what gets loaded in the registers or modify the target PC to point to arbitrary locations. Currently, the emulated trampolines are the ones used in libffi and GCC nested functions. To my knowledge, only X86 is supported at this time. As noted in emutramp.txt, this is not a generic solution. For every new trampoline that needs to be supported, new instruction sequences need to be recognized by the kernel and emulated. And this has to be done for every architecture that needs to be supported. emutramp.txt notes the following: "... the real solution is not in emulation but by designing a kernel API for runtime code generation and modifying userland to make use of it." Trampoline File Descriptor (trampfd) -------------------------- I am proposing a kernel API using anonymous file descriptors that can be used to create and execute trampolines with the help of the kernel. In this solution also, the kernel does the work of the trampoline. The API is described in patch 1/4 of this patchset. I provide a summary here: Trampolines commonly execute the following sequence: - Load some values in some registers and/or - Push some values on the stack - Jump to a target PC libffi and GCC nested function trampolines fit into this model. Using the kernel API, applications and libraries can: - Create a trampoline object - Associate a register context with the trampoline (including a target PC) - Associate a stack context with the trampoline - Map the trampoline into a process address space - Execute the trampoline by executing at the trampoline address The kernel creates the trampoline mapping without any permissions. When the trampoline is executed by user code, a page fault happens and the kernel gets control. The kernel recognizes that this is a trampoline invocation. It sets up the user registers based on the specified register context, and/or pushes values on the user stack based on the specified stack context, and sets the user PC to the requested target PC. When the kernel returns, execution continues at the target PC. So, the kernel does the work of the trampoline on behalf of the application. In this case, the attack surface is the context buffer. A hacker may attack an application with a vulnerability and may be able to modify the context buffer. So, when the register or stack context is set for a trampoline, the values may have been tampered with. From an attack surface perspective, this is similar to Trampoline Emulation. But with trampfd, user code can retrieve a trampoline's context from the kernel and add defensive checks to see if the context has been tampered with. As for the target PC, trampfd implements a measure called the "Allowed PCs" context (see Advantages) to prevent a hacker from making the target PC point to arbitrary locations. So, the attack surface is narrower than Trampoline Emulation. Advantages of the Trampoline File Descriptor approach ----------------------------------------------------- - trampfd is customizable. The user can specify any combination of allowed register name-value pairs in the register context and the kernel will set it up accordingly. This allows different user trampolines to be converted to use trampfd. - trampfd allows a stack context to be set up so that trampolines that need to push values on the user stack can do that. - The initial work is targeted for X86 and ARM. But the implementation leverages small portions of existing signal delivery code. Specifically, it uses pt_regs for setting up user registers and copy_to_user() to push values on the stack. So, this can be very easily ported to other architectures. - trampfd provides a basic framework. In the future, new trampoline types can be implemented, new contexts can be defined, and additional rules can be implemented for security purposes. - For instance, trampfd defines an "Allowed PCs" context in this initial work. As an example, libffi can create a read-only array of all ABI handlers for an architecture at build time. This array can be used to set the list of allowed PCs for a trampoline. This will mean that a hacker cannot hack the PC part of the register context and make it point to arbitrary locations. - An SELinux setting called "exectramp" can be implemented along the lines of "execmem", "execstack" and "execheap" to selectively allow the use of trampolines on a per application basis. - User code can add defensive checks in the code before invoking a trampoline to make sure that a hacker has not modified the context data. It can do this by getting the trampoline context from the kernel and double checking it. - In the future, if the kernel can be enhanced to use a safe code generation component, that code can be placed in the trampoline mapping pages. Then, the trampoline invocation does not have to incur a trip into the kernel. - Also, if the kernel can be enhanced to use a safe code generation component, other forms of dynamic code such as JIT code can be addressed by the trampfd framework. - Trampolines can be shared across processes which can give rise to interesting uses in the future. - Trampfd can be used for other purposes to extend the kernel's functionality. libffi ------ I have implemented my solution for libffi and provided the changes for X86 and ARM, 32-bit and 64-bit. Here is the reference patch: http://linux.microsoft.com/~madvenka/libffi/libffi.txt If the trampfd patchset gets accepted, I will send the libffi changes to the maintainers for a review. BTW, I have also successfully executed the libffi self tests. Work that is pending -------------------- - I am working on implementing an SELinux setting called "exectramp" similar to "execmem" to allow the use of trampfd on a per application basis. - I have a comprehensive test program to test the kernel API. I am working on adding it to selftests. References ---------- [1] https://microsoft.github.io/ipe/ --- Madhavan T. Venkataraman (4): fs/trampfd: Implement the trampoline file descriptor API x86/trampfd: Support for the trampoline file descriptor arm64/trampfd: Support for the trampoline file descriptor arm/trampfd: Support for the trampoline file descriptor arch/arm/include/uapi/asm/ptrace.h | 20 ++ arch/arm/kernel/Makefile | 1 + arch/arm/kernel/trampfd.c | 214 +++++++++++++++++ arch/arm/mm/fault.c | 12 +- arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/ptrace.h | 9 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/arm64/include/uapi/asm/ptrace.h | 57 +++++ arch/arm64/kernel/Makefile | 2 + arch/arm64/kernel/trampfd.c | 278 ++++++++++++++++++++++ arch/arm64/mm/fault.c | 15 +- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/include/uapi/asm/ptrace.h | 38 +++ arch/x86/kernel/Makefile | 2 + arch/x86/kernel/trampfd.c | 313 +++++++++++++++++++++++++ arch/x86/mm/fault.c | 11 + fs/Makefile | 1 + fs/trampfd/Makefile | 6 + fs/trampfd/trampfd_data.c | 43 ++++ fs/trampfd/trampfd_fops.c | 131 +++++++++++ fs/trampfd/trampfd_map.c | 78 ++++++ fs/trampfd/trampfd_pcs.c | 95 ++++++++ fs/trampfd/trampfd_regs.c | 137 +++++++++++ fs/trampfd/trampfd_stack.c | 131 +++++++++++ fs/trampfd/trampfd_stubs.c | 41 ++++ fs/trampfd/trampfd_syscall.c | 92 ++++++++ include/linux/syscalls.h | 3 + include/linux/trampfd.h | 82 +++++++ include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/trampfd.h | 171 ++++++++++++++ init/Kconfig | 8 + kernel/sys_ni.c | 3 + 34 files changed, 1998 insertions(+), 7 deletions(-) create mode 100644 arch/arm/kernel/trampfd.c create mode 100644 arch/arm64/kernel/trampfd.c create mode 100644 arch/x86/kernel/trampfd.c create mode 100644 fs/trampfd/Makefile create mode 100644 fs/trampfd/trampfd_data.c create mode 100644 fs/trampfd/trampfd_fops.c create mode 100644 fs/trampfd/trampfd_map.c create mode 100644 fs/trampfd/trampfd_pcs.c create mode 100644 fs/trampfd/trampfd_regs.c create mode 100644 fs/trampfd/trampfd_stack.c create mode 100644 fs/trampfd/trampfd_stubs.c create mode 100644 fs/trampfd/trampfd_syscall.c create mode 100644 include/linux/trampfd.h create mode 100644 include/uapi/linux/trampfd.h