Message ID | 20200916150826.5990-1-madvenka@linux.microsoft.com (mailing list archive) |
---|---|
Headers | show |
Series | Implement Trampoline File Descriptor | expand |
* madvenka: > Examples of trampolines > ======================= > > libffi (A Portable Foreign Function Interface Library): > > libffi allows a user to define functions with an arbitrary list of > arguments and return value through a feature called "Closures". > Closures use trampolines to jump to ABI handlers that handle calling > conventions and call a target function. libffi is used by a lot > of different applications. To name a few: > > - Python > - Java > - Javascript > - Ruby FFI > - Lisp > - Objective C libffi does not actually need this. It currently collocates trampolines and the data they need on the same page, but that's actually unecessary. It's possible to avoid doing this just by changing libffi, without any kernel changes. I think this has already been done for the iOS port. > The code for trampoline X in the trampoline table is: > > load &code_table[X], code_reg > load (code_reg), code_reg > load &data_table[X], data_reg > load (data_reg), data_reg > jump code_reg > > The addresses &code_table[X] and &data_table[X] are baked into the > trampoline code. So, PC-relative data references are not needed. The user > can modify code_table[X] and data_table[X] dynamically. You can put this code into the libffi shared object and map it from there, just like the rest of the libffi code. To get more trampolines, you can map the page containing the trampolines multiple times, each instance preceded by a separate data page with the control information. I think the previous patch submission has also resulted in several comments along those lines, so I'm not sure why you are reposting this. > libffi > ====== > > I have implemented my solution for libffi and provided the changes for > X86 and ARM, 32-bit and 64-bit. Here is the reference patch: > > http://linux.microsoft.com/~madvenka/libffi/libffi.v2.txt The URL does not appear to work, I get a 403 error. > If the trampfd patchset gets accepted, I will send the libffi changes > to the maintainers for a review. BTW, I have also successfully executed > the libffi self tests. I have not seen your libffi changes, but I expect that the complexity is about the same as a userspace-only solution. Cc:ing libffi upstream for awareness. The start of the thread is here: <https://lore.kernel.org/linux-api/20200916150826.5990-1-madvenka@linux.microsoft.com/>
On 9/16/20 8:04 PM, Florian Weimer wrote: > * madvenka: > >> Examples of trampolines >> ======================= >> >> libffi (A Portable Foreign Function Interface Library): >> >> libffi allows a user to define functions with an arbitrary list of >> arguments and return value through a feature called "Closures". >> Closures use trampolines to jump to ABI handlers that handle calling >> conventions and call a target function. libffi is used by a lot >> of different applications. To name a few: >> >> - Python >> - Java >> - Javascript >> - Ruby FFI >> - Lisp >> - Objective C > > libffi does not actually need this. It currently collocates > trampolines and the data they need on the same page, but that's > actually unecessary. It's possible to avoid doing this just by > changing libffi, without any kernel changes. > > I think this has already been done for the iOS port. > The trampoline table that has been implemented for the iOS port (MACH) is based on PC-relative data referencing. That is, the code and data are placed in adjacent pages so that the code can access the data using an address relative to the current PC. This is an ISA feature that is not supported on all architectures. Now, if it is a performance feature, we can include some architectures and exclude others. But this is a security feature. IMO, we cannot exclude any architecture even if it is a legacy one as long as Linux is running on the architecture. So, we need a solution that does not assume any specific ISA feature. >> The code for trampoline X in the trampoline table is: >> >> load &code_table[X], code_reg >> load (code_reg), code_reg >> load &data_table[X], data_reg >> load (data_reg), data_reg >> jump code_reg >> >> The addresses &code_table[X] and &data_table[X] are baked into the >> trampoline code. So, PC-relative data references are not needed. The user >> can modify code_table[X] and data_table[X] dynamically. > > You can put this code into the libffi shared object and map it from > there, just like the rest of the libffi code. To get more > trampolines, you can map the page containing the trampolines multiple > times, each instance preceded by a separate data page with the control > information. > If you put the code in the libffi shared object, how do you pass data to the code at runtime? If the code we are talking about is a function, then there is an ABI defined way to pass data to the function. But if the code we are talking about is some arbitrary code such as a trampoline, there is no ABI defined way to pass data to it except in a couple of platforms such as HP PA-RISC that have support for function descriptors in the ABI itself. As mentioned before, if the ISA supports PC-relative data references (e.g., X86 64-bit platforms support RIP-relative data references) then we can pass data to that code by placing the code and data in adjacent pages. So, you can implement the trampoline table for X64. i386 does not support it. > I think the previous patch submission has also resulted in several > comments along those lines, so I'm not sure why you are reposting > this. IIRC, I have answered all of those comments by mentioning the point that we need to support all architectures without requiring special ISA features. Taking the kernel's help in this is one solution. > >> libffi >> ====== >> >> I have implemented my solution for libffi and provided the changes for >> X86 and ARM, 32-bit and 64-bit. Here is the reference patch: >> >> http://linux.microsoft.com/~madvenka/libffi/libffi.v2.txt > > The URL does not appear to work, I get a 403 error. I apologize for that. That site is supposed to be accessible publicly. I will contact the administrator and get this resolved. Sorry for the annoyance. > >> If the trampfd patchset gets accepted, I will send the libffi changes >> to the maintainers for a review. BTW, I have also successfully executed >> the libffi self tests. > > I have not seen your libffi changes, but I expect that the complexity > is about the same as a userspace-only solution. > > I agree. The complexity is about the same. But the support is for all architectures. Once the common code is in place, the changes for each architecture are trivial. Madhavan > Cc:ing libffi upstream for awareness. The start of the thread is > here: > > <https://lore.kernel.org/linux-api/20200916150826.5990-1-madvenka@linux.microsoft.com/> >
On 9/17/20 10:36 AM, Madhavan T. Venkataraman wrote: >>> libffi >>> ====== >>> >>> I have implemented my solution for libffi and provided the changes for >>> X86 and ARM, 32-bit and 64-bit. Here is the reference patch: >>> >>> http://linux.microsoft.com/~madvenka/libffi/libffi.v2.txt >> The URL does not appear to work, I get a 403 error. > I apologize for that. That site is supposed to be accessible publicly. > I will contact the administrator and get this resolved. > > Sorry for the annoyance. > Could you try the link again and confirm that you can access it? Again, sorry for the trouble. Madhavan
* Madhavan T. Venkataraman: > On 9/17/20 10:36 AM, Madhavan T. Venkataraman wrote: >>>> libffi >>>> ====== >>>> >>>> I have implemented my solution for libffi and provided the changes for >>>> X86 and ARM, 32-bit and 64-bit. Here is the reference patch: >>>> >>>> http://linux.microsoft.com/~madvenka/libffi/libffi.v2.txt >>> The URL does not appear to work, I get a 403 error. >> I apologize for that. That site is supposed to be accessible publicly. >> I will contact the administrator and get this resolved. >> >> Sorry for the annoyance. > Could you try the link again and confirm that you can access it? > Again, sorry for the trouble. Yes, it works now. Thanks for having it fixed.
On Thu, Sep 17, 2020 at 10:36:02AM -0500, Madhavan T. Venkataraman wrote: > > > On 9/16/20 8:04 PM, Florian Weimer wrote: > > * madvenka: > > > >> Examples of trampolines > >> ======================= > >> > >> libffi (A Portable Foreign Function Interface Library): > >> > >> libffi allows a user to define functions with an arbitrary list of > >> arguments and return value through a feature called "Closures". > >> Closures use trampolines to jump to ABI handlers that handle calling > >> conventions and call a target function. libffi is used by a lot > >> of different applications. To name a few: > >> > >> - Python > >> - Java > >> - Javascript > >> - Ruby FFI > >> - Lisp > >> - Objective C > > > > libffi does not actually need this. It currently collocates > > trampolines and the data they need on the same page, but that's > > actually unecessary. It's possible to avoid doing this just by > > changing libffi, without any kernel changes. > > > > I think this has already been done for the iOS port. > > > > The trampoline table that has been implemented for the iOS port (MACH) > is based on PC-relative data referencing. That is, the code and data > are placed in adjacent pages so that the code can access the data using > an address relative to the current PC. > > This is an ISA feature that is not supported on all architectures. > > Now, if it is a performance feature, we can include some architectures > and exclude others. But this is a security feature. IMO, we cannot > exclude any architecture even if it is a legacy one as long as Linux > is running on the architecture. So, we need a solution that does > not assume any specific ISA feature. Which ISA does not support PIC objects? You mentioned i386 below, but i386 does support them, it just needs to copy the PC into a GPR first (see below). > > >> The code for trampoline X in the trampoline table is: > >> > >> load &code_table[X], code_reg > >> load (code_reg), code_reg > >> load &data_table[X], data_reg > >> load (data_reg), data_reg > >> jump code_reg > >> > >> The addresses &code_table[X] and &data_table[X] are baked into the > >> trampoline code. So, PC-relative data references are not needed. The user > >> can modify code_table[X] and data_table[X] dynamically. > > > > You can put this code into the libffi shared object and map it from > > there, just like the rest of the libffi code. To get more > > trampolines, you can map the page containing the trampolines multiple > > times, each instance preceded by a separate data page with the control > > information. > > > > If you put the code in the libffi shared object, how do you pass data to > the code at runtime? If the code we are talking about is a function, then > there is an ABI defined way to pass data to the function. But if the > code we are talking about is some arbitrary code such as a trampoline, > there is no ABI defined way to pass data to it except in a couple of > platforms such as HP PA-RISC that have support for function descriptors > in the ABI itself. > > As mentioned before, if the ISA supports PC-relative data references > (e.g., X86 64-bit platforms support RIP-relative data references) > then we can pass data to that code by placing the code and data in > adjacent pages. So, you can implement the trampoline table for X64. > i386 does not support it. > i386 just needs a tiny bit of code to copy the PC into a GPR first, i.e. the trampoline would be: call 1f 1: pop %data_reg movl (code_table + X - 1b)(%data_reg), %code_reg movl (data_table + X - 1b)(%data_reg), %data_reg jmp *(%code_reg) I do not understand the point about passing data at runtime. This trampoline is to achieve exactly that, no? Thanks.
> As mentioned before, if the ISA supports PC-relative data references > (e.g., X86 64-bit platforms support RIP-relative data references) > then we can pass data to that code by placing the code and data in > adjacent pages. So, you can implement the trampoline table for X64. > i386 does not support it. i386 does not need this either. You make a PC-relative call, read the return address into a register, and then do register-relative data access. either: call get_pc ; PC-relative call mov eax, [eax+x] get_pc: mov eax, [esp] ret or if you don't mind disrupting the return address predictor: call +0 pop eax mov eax, [eax+x] where x is computed by the static linker, and eax can vary. The same way PIC code normally works I think. Also the data and code do not have to be on adjacent pages in this scheme. You can just map an entire .dll/.so additional times. A little wasteful, yes, but quite convenient. Factor the thunks/trampolines into their own .so/.dll to make it not very wasteful. The functions do not even have to be a fixed distance from their array element either. Architectures that are "naturally" position independent (amd64, arm64) do not even need any assembly to do this. Just use C and stamp out multiple copies with the C preprocessor. But arm32 and x86 do tend to need some assembly, depending on compilation model, etc. (i.e. on Windows at least). Is there any architecture that lacks both PC-relative data access and PC-relative call, with ability to materialize the return address into a register? Given codegen that is not "arbitrary", you make it "data driven" and you don't need kernel support. Unless there really exists architectures that cannot reasonably synthesize PC-relative data access. ? As long as you can use mmap or similar to map a .so/.dll any number of times, to produce any number of thunks. On Windows that this is CreateFileMapping(SEC_IMAGE) + MapViewOfFile. i.e. not dlopen and not LoadLibrary, they just increment a reference count and return the original mapping. - Jay From: Libffi-discuss <libffi-discuss-bounces@sourceware.org> on behalf of Madhavan T. Venkataraman via Libffi-discuss <libffi-discuss@sourceware.org> Sent: Thursday, September 17, 2020 3:36 PM To: Florian Weimer <fw@deneb.enyo.de> Cc: kernel-hardening@lists.openwall.com <kernel-hardening@lists.openwall.com>; linux-api@vger.kernel.org <linux-api@vger.kernel.org>; x86@kernel.org <x86@kernel.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; oleg@redhat.com <oleg@redhat.com>; linux-security-module@vger.kernel.org <linux-security-module@vger.kernel.org>; linux-fsdevel@vger.kernel.org <linux-fsdevel@vger.kernel.org>; linux-integrity@vger.kernel.org <linux-integrity@vger.kernel.org>; libffi-discuss@sourceware.org <libffi-discuss@sourceware.org>; linux-arm-kernel@lists.infradead.org <linux-arm-kernel@lists.infradead.org> Subject: Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor On 9/16/20 8:04 PM, Florian Weimer wrote: > * madvenka: > >> Examples of trampolines >> ======================= >> >> libffi (A Portable Foreign Function Interface Library): >> >> libffi allows a user to define functions with an arbitrary list of >> arguments and return value through a feature called "Closures". >> Closures use trampolines to jump to ABI handlers that handle calling >> conventions and call a target function. libffi is used by a lot >> of different applications. To name a few: >> >> - Python >> - Java >> - Javascript >> - Ruby FFI >> - Lisp >> - Objective C > > libffi does not actually need this. It currently collocates > trampolines and the data they need on the same page, but that's > actually unecessary. It's possible to avoid doing this just by > changing libffi, without any kernel changes. > > I think this has already been done for the iOS port. > The trampoline table that has been implemented for the iOS port (MACH) is based on PC-relative data referencing. That is, the code and data are placed in adjacent pages so that the code can access the data using an address relative to the current PC. This is an ISA feature that is not supported on all architectures. Now, if it is a performance feature, we can include some architectures and exclude others. But this is a security feature. IMO, we cannot exclude any architecture even if it is a legacy one as long as Linux is running on the architecture. So, we need a solution that does not assume any specific ISA feature. >> The code for trampoline X in the trampoline table is: >> >> load &code_table[X], code_reg >> load (code_reg), code_reg >> load &data_table[X], data_reg >> load (data_reg), data_reg >> jump code_reg >> >> The addresses &code_table[X] and &data_table[X] are baked into the >> trampoline code. So, PC-relative data references are not needed. The user >> can modify code_table[X] and data_table[X] dynamically. > > You can put this code into the libffi shared object and map it from > there, just like the rest of the libffi code. To get more > trampolines, you can map the page containing the trampolines multiple > times, each instance preceded by a separate data page with the control > information. > If you put the code in the libffi shared object, how do you pass data to the code at runtime? If the code we are talking about is a function, then there is an ABI defined way to pass data to the function. But if the code we are talking about is some arbitrary code such as a trampoline, there is no ABI defined way to pass data to it except in a couple of platforms such as HP PA-RISC that have support for function descriptors in the ABI itself. As mentioned before, if the ISA supports PC-relative data references (e.g., X86 64-bit platforms support RIP-relative data references) then we can pass data to that code by placing the code and data in adjacent pages. So, you can implement the trampoline table for X64. i386 does not support it. > I think the previous patch submission has also resulted in several > comments along those lines, so I'm not sure why you are reposting > this. IIRC, I have answered all of those comments by mentioning the point that we need to support all architectures without requiring special ISA features. Taking the kernel's help in this is one solution. > >> libffi >> ====== >> >> I have implemented my solution for libffi and provided the changes for >> X86 and ARM, 32-bit and 64-bit. Here is the reference patch: >> >> https://nam10.safelinks.protection.outlook.com/?url=http:%2F%2Flinux.microsoft.com%2F~madvenka%2Flibffi%2Flibffi.v2.txt&data=02%7C01%7C%7C25b693de3de342e1e02c08d85b1f6af5%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637359537776320186&sdata=b%2BqpgrUoSy%2FrprtE4xgd0%2FhPiFxTOh69yYjlTkgSQoc%3D&reserved=0 > > The URL does not appear to work, I get a 403 error. I apologize for that. That site is supposed to be accessible publicly. I will contact the administrator and get this resolved. Sorry for the annoyance. > >> If the trampfd patchset gets accepted, I will send the libffi changes >> to the maintainers for a review. BTW, I have also successfully executed >> the libffi self tests. > > I have not seen your libffi changes, but I expect that the complexity > is about the same as a userspace-only solution. > > I agree. The complexity is about the same. But the support is for all architectures. Once the common code is in place, the changes for each architecture are trivial. Madhavan > Cc:ing libffi upstream for awareness. The start of the thread is > here: > > <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-api%2F20200916150826.5990-1-madvenka%40linux.microsoft.com%2F&data=02%7C01%7C%7C25b693de3de342e1e02c08d85b1f6af5%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637359537776320186&sdata=nIIDBh6F%2Fit%2BklEWLzuy0iiKCCf%2BxRf4JNZS8LbFkOY%3D&reserved=0> >
On Tue, Sep 22, 2020 at 09:46:16PM -0400, Arvind Sankar wrote: > On Thu, Sep 17, 2020 at 10:36:02AM -0500, Madhavan T. Venkataraman wrote: > > > > > > On 9/16/20 8:04 PM, Florian Weimer wrote: > > > * madvenka: > > > > > >> Examples of trampolines > > >> ======================= > > >> > > >> libffi (A Portable Foreign Function Interface Library): > > >> > > >> libffi allows a user to define functions with an arbitrary list of > > >> arguments and return value through a feature called "Closures". > > >> Closures use trampolines to jump to ABI handlers that handle calling > > >> conventions and call a target function. libffi is used by a lot > > >> of different applications. To name a few: > > >> > > >> - Python > > >> - Java > > >> - Javascript > > >> - Ruby FFI > > >> - Lisp > > >> - Objective C > > > > > > libffi does not actually need this. It currently collocates > > > trampolines and the data they need on the same page, but that's > > > actually unecessary. It's possible to avoid doing this just by > > > changing libffi, without any kernel changes. > > > > > > I think this has already been done for the iOS port. > > > > > > > The trampoline table that has been implemented for the iOS port (MACH) > > is based on PC-relative data referencing. That is, the code and data > > are placed in adjacent pages so that the code can access the data using > > an address relative to the current PC. > > > > This is an ISA feature that is not supported on all architectures. > > > > Now, if it is a performance feature, we can include some architectures > > and exclude others. But this is a security feature. IMO, we cannot > > exclude any architecture even if it is a legacy one as long as Linux > > is running on the architecture. So, we need a solution that does > > not assume any specific ISA feature. > > Which ISA does not support PIC objects? You mentioned i386 below, but > i386 does support them, it just needs to copy the PC into a GPR first > (see below). > > > > > >> The code for trampoline X in the trampoline table is: > > >> > > >> load &code_table[X], code_reg > > >> load (code_reg), code_reg > > >> load &data_table[X], data_reg > > >> load (data_reg), data_reg > > >> jump code_reg > > >> > > >> The addresses &code_table[X] and &data_table[X] are baked into the > > >> trampoline code. So, PC-relative data references are not needed. The user > > >> can modify code_table[X] and data_table[X] dynamically. > > > > > > You can put this code into the libffi shared object and map it from > > > there, just like the rest of the libffi code. To get more > > > trampolines, you can map the page containing the trampolines multiple > > > times, each instance preceded by a separate data page with the control > > > information. > > > > > > > If you put the code in the libffi shared object, how do you pass data to > > the code at runtime? If the code we are talking about is a function, then > > there is an ABI defined way to pass data to the function. But if the > > code we are talking about is some arbitrary code such as a trampoline, > > there is no ABI defined way to pass data to it except in a couple of > > platforms such as HP PA-RISC that have support for function descriptors > > in the ABI itself. > > > > As mentioned before, if the ISA supports PC-relative data references > > (e.g., X86 64-bit platforms support RIP-relative data references) > > then we can pass data to that code by placing the code and data in > > adjacent pages. So, you can implement the trampoline table for X64. > > i386 does not support it. > > > > i386 just needs a tiny bit of code to copy the PC into a GPR first, i.e. > the trampoline would be: > > call 1f > 1: pop %data_reg > movl (code_table + X - 1b)(%data_reg), %code_reg > movl (data_table + X - 1b)(%data_reg), %data_reg > jmp *(%code_reg) > > I do not understand the point about passing data at runtime. This > trampoline is to achieve exactly that, no? > > Thanks. For libffi, I think the proposed standard trampoline won't actually work, because not all ABIs have two scratch registers available to use as code_reg and data_reg. Eg i386 fastcall only has one, and register has zero scratch registers. I believe 32-bit ARM only has one scratch register as well. For i386 you'd need something that saves a register on the stack first, maybe like the below with a 16-byte trampoline and a 16-byte context structure that has the address of the code to jump to in the first dword: .balign 4096 trampoline_page: .rept 4096/16-1 0: endbr32 push %eax call __x86.get_pc_thunk.ax 1: jmp trampoline .balign 16 .endr .org trampoline_page + 4096 - 16 __x86.get_pc_thunk.ax: movl (%esp), %eax ret trampoline: subl $(1b-0b), %eax jmp *(table-trampoline_page)(%eax) .org trampoline_page + 4096 table:
On 9/23/20 4:11 AM, Arvind Sankar wrote: > For libffi, I think the proposed standard trampoline won't actually > work, because not all ABIs have two scratch registers available to use > as code_reg and data_reg. Eg i386 fastcall only has one, and register > has zero scratch registers. I believe 32-bit ARM only has one scratch > register as well. The trampoline is invoked as a function call in the libffi case. Any caller saved register can be used as code_reg, can it not? And the scratch register is needed only to jump to the code. After that, it can be reused for any other purpose. However, for ARM, you are quite correct. There is only one scratch register. This means that I have to provide two types of trampolines: - If an architecture has enough scratch registers, use the currently defined trampoline. - If the architecture has only one scratch register, but has PC-relative data references, then embed the code address at the bottom of the trampoline and access it using PC-relative addressing. Thanks for pointing this out. Madhavan
On Wed, Sep 23, 2020 at 02:17:30PM -0500, Madhavan T. Venkataraman wrote: > > > On 9/23/20 4:11 AM, Arvind Sankar wrote: > > For libffi, I think the proposed standard trampoline won't actually > > work, because not all ABIs have two scratch registers available to use > > as code_reg and data_reg. Eg i386 fastcall only has one, and register > > has zero scratch registers. I believe 32-bit ARM only has one scratch > > register as well. > > The trampoline is invoked as a function call in the libffi case. Any > caller saved register can be used as code_reg, can it not? And the > scratch register is needed only to jump to the code. After that, it > can be reused for any other purpose. > > However, for ARM, you are quite correct. There is only one scratch > register. This means that I have to provide two types of trampolines: > > - If an architecture has enough scratch registers, use the currently > defined trampoline. > > - If the architecture has only one scratch register, but has PC-relative > data references, then embed the code address at the bottom of the > trampoline and access it using PC-relative addressing. > > Thanks for pointing this out. > > Madhavan libffi is trying to provide closures with non-standard ABIs as well: the actual user function is standard ABI, but the closure can be called with a different ABI. If the closure was created with FFI_REGISTER abi, there are no registers available for the trampoline to use: EAX, EDX and ECX contain the first three arguments of the function, and every other register is callee-save. I provided a sample of the kind of trampoline that would be needed in this case -- it's position-independent and doesn't clobber any registers at all, and you get 255 trampolines per page. If I take another 16-byte slot out of the page for the end trampoline that does the actual work, I'm sure I could even come up with one that can just call a normal C function, only the return might need special handling depending on the return type. And again, do you actually have any example of an architecture that cannot run position-independent code? PC-relative addressing is an implementation detail: the fact that it's available for x86_64 but not for i386 just makes position-independent code more cumbersome on i386, but it doesn't make it impossible. For the tiny trampolines here, it makes almost no difference.
On 9/23/20 2:51 PM, Arvind Sankar wrote: > On Wed, Sep 23, 2020 at 02:17:30PM -0500, Madhavan T. Venkataraman wrote: >> >> >> On 9/23/20 4:11 AM, Arvind Sankar wrote: >>> For libffi, I think the proposed standard trampoline won't actually >>> work, because not all ABIs have two scratch registers available to use >>> as code_reg and data_reg. Eg i386 fastcall only has one, and register >>> has zero scratch registers. I believe 32-bit ARM only has one scratch >>> register as well. >> >> The trampoline is invoked as a function call in the libffi case. Any >> caller saved register can be used as code_reg, can it not? And the >> scratch register is needed only to jump to the code. After that, it >> can be reused for any other purpose. >> >> However, for ARM, you are quite correct. There is only one scratch >> register. This means that I have to provide two types of trampolines: >> >> - If an architecture has enough scratch registers, use the currently >> defined trampoline. >> >> - If the architecture has only one scratch register, but has PC-relative >> data references, then embed the code address at the bottom of the >> trampoline and access it using PC-relative addressing. >> >> Thanks for pointing this out. >> >> Madhavan > > libffi is trying to provide closures with non-standard ABIs as well: the > actual user function is standard ABI, but the closure can be called with > a different ABI. If the closure was created with FFI_REGISTER abi, there > are no registers available for the trampoline to use: EAX, EDX and ECX > contain the first three arguments of the function, and every other > register is callee-save. > > I provided a sample of the kind of trampoline that would be needed in > this case -- it's position-independent and doesn't clobber any registers > at all, and you get 255 trampolines per page. If I take another 16-byte > slot out of the page for the end trampoline that does the actual work, > I'm sure I could even come up with one that can just call a normal C > function, only the return might need special handling depending on the > return type. > > And again, do you actually have any example of an architecture that > cannot run position-independent code? PC-relative addressing is an > implementation detail: the fact that it's available for x86_64 but not > for i386 just makes position-independent code more cumbersome on i386, > but it doesn't make it impossible. For the tiny trampolines here, it > makes almost no difference. > Hi Arvind, I am preparing a response for all of your comments. I will send it out tomorrow. Sorry for the delay. Madhavan
On 9/23/20 2:51 PM, Arvind Sankar wrote: > On Wed, Sep 23, 2020 at 02:17:30PM -0500, Madhavan T. Venkataraman wrote: >> >> >> On 9/23/20 4:11 AM, Arvind Sankar wrote: >>> For libffi, I think the proposed standard trampoline won't actually >>> work, because not all ABIs have two scratch registers available to use >>> as code_reg and data_reg. Eg i386 fastcall only has one, and register >>> has zero scratch registers. I believe 32-bit ARM only has one scratch >>> register as well. >> >> The trampoline is invoked as a function call in the libffi case. Any >> caller saved register can be used as code_reg, can it not? And the >> scratch register is needed only to jump to the code. After that, it >> can be reused for any other purpose. >> >> However, for ARM, you are quite correct. There is only one scratch >> register. This means that I have to provide two types of trampolines: >> >> - If an architecture has enough scratch registers, use the currently >> defined trampoline. >> >> - If the architecture has only one scratch register, but has PC-relative >> data references, then embed the code address at the bottom of the >> trampoline and access it using PC-relative addressing. >> >> Thanks for pointing this out. >> >> Madhavan > > libffi is trying to provide closures with non-standard ABIs as well: the > actual user function is standard ABI, but the closure can be called with > a different ABI. If the closure was created with FFI_REGISTER abi, there > are no registers available for the trampoline to use: EAX, EDX and ECX > contain the first three arguments of the function, and every other > register is callee-save. > > I provided a sample of the kind of trampoline that would be needed in > this case -- it's position-independent and doesn't clobber any registers > at all, and you get 255 trampolines per page. If I take another 16-byte > slot out of the page for the end trampoline that does the actual work, > I'm sure I could even come up with one that can just call a normal C > function, only the return might need special handling depending on the > return type. > > And again, do you actually have any example of an architecture that > cannot run position-independent code? PC-relative addressing is an > implementation detail: the fact that it's available for x86_64 but not > for i386 just makes position-independent code more cumbersome on i386, > but it doesn't make it impossible. For the tiny trampolines here, it > makes almost no difference. > I have tried to answer all of your previous comments here. Let me know if I missed anything: > Which ISA does not support PIC objects? You mentioned i386 below, but > i386 does support them, it just needs to copy the PC into a GPR first > (see below). Position Independent Code needs PC-relative branches. I was referring to PC-relative data references. Like RIP-relative data references in X64. i386 ISA does not support this. > i386 just needs a tiny bit of code to copy the PC into a GPR first, i.e. > the trampoline would be: > > call 1f > 1: pop %data_reg > movl (code_table + X - 1b)(%data_reg), %code_reg > movl (data_table + X - 1b)(%data_reg), %data_reg > jmp *(%code_reg) > > I do not understand the point about passing data at runtime. This > trampoline is to achieve exactly that, no? PC-relative data referencing ---------------------------- I agree that the current PC value can be loaded in a GPR using the trick of call, pop on i386. Perhaps, on other architectures, we can do similar things. For instance, in architectures that load the return address in a designated register instead of pushing it on the stack, the trampoline could call a leaf function that moves the value of that register into data_reg so that at the location after the call instruction, the current PC is already loaded in data_reg. SPARC is one example I can think of. My take is - if the ISA supports PC-relative data referencing explicitly (like X64 or ARM64), then we can use it. Or, if the ABI specification documents an approved way to load the PC into a GPR, we can use it. Otherwise, using an ABI quirk or a calling convention side effect to load the PC into a GPR is, IMO, non-standard or non-compliant or non-approved or whatever you want to call it. I would be conservative and not use it. Who knows what incompatibility there will be with some future software or hardware features? For instance, in the i386 example, we do a call without a matching return. Also, we use a pop to undo the call. Can anyone tell me if this kind of use is an ABI approved one? Kernel supplied trampoline -------------------------- One advantage in doing this in the kernel is that we don't need to use non-standard or non-ABI compliant code. To minimize the number of registers used by the trampoline, I will redefine the kernel generated trampoline as follows: - The kernel loads the trampoline and the code and the data addresses to be dereferenced like this: A ----> ------------------- | Trampoline code | B ----> ------------------- | Data Address | ------------------- | Code Address | ------------------- So, the trampoline code would be: mov B, %data_reg jump (%data_reg + sizeof(Data address)) The kernel will hard code B into the trampoline. The static code that the trampoline jumps to looks like this: load (%data_reg), %data_reg rest of the code Use of scratch registers ------------------------ With this new trampoline, we only use one scratch register. So, the same RFC will work for libffi on ARM. You pointed out that in the FFI_REGISTER ABI no scratch registers can be used. Read the section "Secure vs Performant trampoline" below where this is addressed. Standard API for all userland for all architectures --------------------------------------------------- The next advantage in using the kernel is standardization. If the kernel supplies this, then all applications and libraries can use it for all architectures with one single, simple API. Without this, each application/library has to roll its own solution for every architecture-ABI combo it wants to support. Furthermore, if this work gets accepted, I plan to add a glibc wrapper for the kernel API. The glibc API would look something like this: Allocate a trampoline --------------------- tramp = alloc_tramp(); Set trampoline parameters ------------------------- init_tramp(tramp, code, data); Free the trampoline ------------------- free_tramp(tramp); glibc will allocate and manage the code and data tables, handle kernel API details and manage the trampoline table. As an example, in libffi: ffi_closure_alloc() would call alloc_tramp() ffi_prep_closure_loc() would call init_tramp() ffi_closure_free() would call free_tramp() That is it! It works on all the architectures supported in the kernel for trampfd. This makes it really easy for maintainers to adopt the API and move their code to a more secure model (which is the fundamental idea behind this work). For this advantage alone, IMO, it is worth doing it in the kernel. Secure vs Performant trampoline ------------------------------- If you recall, in version 1, I presented a trampoline type that is implemented in the kernel. When an application invokes the trampoline, it traps into the kernel and the kernel performs the work of the trampoline. The disadvantage is that a trip to the kernel is needed. That can be expensive. The advantage is that the kernel can add security checks before doing the work. Mainly, I am looking at checks that might prevent the trampoline from being used in an ROP/BOP chain. Some half-baked ideas: - Check that the invocation is at the starting point of the trampoline - Check if the trampoline is jumping to an allowed PC - Check if the trampoline is being invoked from an allowed calling PC or PC range Allowed PCs can be input using the trampfd API mentioned in version 1. Basically, an array of PCs is written into trampfd. Suggestions for other checks are most welcome! I would like to implement an option in the trampfd API. The user can choose a secure trampoline or a performant trampoline. For a performant trampoline, the kernel will generate the code. For a secure trampoline, the kernel will do the work itself. In order to address the FFI_REGISTER ABI in libffi, we could use the secure trampoline. In FFI_REGISTER, the data is pushed on the stack and the code is jumped to without using any registers. As outlined in version 1, the kernel can push the data address on the stack and write the code address into the PC and return to userland. For doing all of this, we need trampfd. Permitting the use of trampfd ----------------------------- An "exectramp" setting can be implemented in SELinux to selectively allow the use of trampfd for applications. Madhavan
* Madhavan T. Venkataraman: > Otherwise, using an ABI quirk or a calling convention side effect to > load the PC into a GPR is, IMO, non-standard or non-compliant or > non-approved or whatever you want to call it. I would be > conservative and not use it. Who knows what incompatibility there > will be with some future software or hardware features? AArch64 PAC makes a backwards-incompatible change that touches this area, but we'll see if they can actually get away with it. In general, these things are baked into the ABI, even if they are not spelled out explicitly in the psABI supplement. > For instance, in the i386 example, we do a call without a matching return. > Also, we use a pop to undo the call. Can anyone tell me if this kind of use > is an ABI approved one? Yes, for i386, this is completely valid from an ABI point of view. It's equally possible to use a regular function call and just read the return address that has been pushed to the stack. Then there's no stack mismatch at all. Return stack predictors (including the one used by SHSTK) also recognize the CALL 0 construct, so that's fine as well. The i386 psABI does not use function descriptors, and either approach (out-of-line thunk or CALL 0) is in common use to materialize the program counter in a register and construct the GOT pointer. > If the kernel supplies this, then all applications and libraries can use > it for all architectures with one single, simple API. Without this, each > application/library has to roll its own solution for every architecture-ABI > combo it wants to support. Is there any other user for these type-generic trampolines? Everything else I've seen generates machine code specific to the function being called. libffi is quite the outlier in my experience because the trampoline calls a generic data-driven marshaller/unmarshaller. The other trampoline generators put this marshalling code directly into the generated trampoline. I'm still not convinced that this can't be done directly in libffi, without kernel help. Hiding the architecture-specific code in the kernel doesn't reduce overall system complexity. > As an example, in libffi: > > ffi_closure_alloc() would call alloc_tramp() > > ffi_prep_closure_loc() would call init_tramp() > > ffi_closure_free() would call free_tramp() > > That is it! It works on all the architectures supported in the kernel for > trampfd. ffi_prep_closure_loc would still need to check whether the trampoline has been allocated by alloc_tramp because some applications supply their own (executable and writable) mapping. ffi_closure_alloc would need to support different sizes (not matching the trampoline). It's also unclear to me to what extent software out there writes to the trampoline data directly, bypassing the libffi API (the structs are not opaque, after all). And all the existing libffi memory management code (including the embedded dlmalloc copy) would be needed to support kernels without trampfd for years to come. I very much agree that we have a gap in libffi when it comes to JIT-less operation. But I'm not convinced that kernel support is needed to close it, or that it is even the right design.
Hi! > PC-relative data referencing > ---------------------------- > > I agree that the current PC value can be loaded in a GPR using the trick > of call, pop on i386. > > Perhaps, on other architectures, we can do similar things. For instance, > in architectures that load the return address in a designated register > instead of pushing it on the stack, the trampoline could call a leaf function > that moves the value of that register into data_reg so that at the location > after the call instruction, the current PC is already loaded in data_reg. > SPARC is one example I can think of. > > My take is - if the ISA supports PC-relative data referencing explicitly (like > X64 or ARM64), then we can use it. Or, if the ABI specification documents an > approved way to load the PC into a GPR, we can use it. > > Otherwise, using an ABI quirk or a calling convention side effect to load the > PC into a GPR is, IMO, non-standard or non-compliant or non-approved or > whatever you want to call it. I would be conservative and not use ISAs are very well defined, and basically not changing. If you want to argue we should not use something, you should have very clear picture _why_ it is bad. "Non-standard or non-approved or whatever" just does not cut it. And yes, certain tricks may be seriously slow on modern CPUs, and we might want to avoid those. But other than that... you should have better argument than "it is non-standard". Best regards, Pavel
On Thu, Sep 24, 2020 at 03:23:52PM -0500, Madhavan T. Venkataraman wrote: > > > > Which ISA does not support PIC objects? You mentioned i386 below, but > > i386 does support them, it just needs to copy the PC into a GPR first > > (see below). > > Position Independent Code needs PC-relative branches. I was referring > to PC-relative data references. Like RIP-relative data references in > X64. i386 ISA does not support this. I was talking about PC-relative data references too: they are a requirement for PIC code that wants to access any global data. They can be implemented easily on i386 even though it doesn't have an addressing mode that uses the PC. > Otherwise, using an ABI quirk or a calling convention side effect to load the > PC into a GPR is, IMO, non-standard or non-compliant or non-approved or > whatever you want to call it. I would be conservative and not use it. Who knows > what incompatibility there will be with some future software or hardware > features? > > For instance, in the i386 example, we do a call without a matching return. > Also, we use a pop to undo the call. Can anyone tell me if this kind of use > is an ABI approved one? This doesn't have anything to do with the ABI, since what happened here isn't visible to any caller or callee. Any machine instruction sequence that has the effect of copying the PC into a GPR is acceptable, but this is basically the only possible solution on i386. If you don't like the call/pop mismatch (though that's supported by the hardware, and is what clang likes to use), you can use the slightly different technique used in my example, which copies the top of stack into a GPR after a call. This is how all i386 PIC code has always worked. > Standard API for all userland for all architectures > --------------------------------------------------- > > The next advantage in using the kernel is standardization. > > If the kernel supplies this, then all applications and libraries can use > it for all architectures with one single, simple API. Without this, each > application/library has to roll its own solution for every architecture-ABI > combo it wants to support. But you can get even more standardization out of a userspace library, because that can work even on non-linux OS's, as well as versions of linux where the new syscall isn't available. > > Furthermore, if this work gets accepted, I plan to add a glibc wrapper for > the kernel API. The glibc API would look something like this: > > Allocate a trampoline > --------------------- > > tramp = alloc_tramp(); > > Set trampoline parameters > ------------------------- > > init_tramp(tramp, code, data); > > Free the trampoline > ------------------- > > free_tramp(tramp); > > glibc will allocate and manage the code and data tables, handle kernel API > details and manage the trampoline table. glibc could do this already if it wants, even without the syscall, because this can be done in userspace already. > > Secure vs Performant trampoline > ------------------------------- > > If you recall, in version 1, I presented a trampoline type that is > implemented in the kernel. When an application invokes the trampoline, > it traps into the kernel and the kernel performs the work of the trampoline. > > The disadvantage is that a trip to the kernel is needed. That can be > expensive. > > The advantage is that the kernel can add security checks before doing the > work. Mainly, I am looking at checks that might prevent the trampoline > from being used in an ROP/BOP chain. Some half-baked ideas: > > - Check that the invocation is at the starting point of the > trampoline > > - Check if the trampoline is jumping to an allowed PC > > - Check if the trampoline is being invoked from an allowed > calling PC or PC range > > Allowed PCs can be input using the trampfd API mentioned in version 1. > Basically, an array of PCs is written into trampfd. The source PC will generally not be available if the compiler decided to tail-call optimize the call to the trampoline into a jump. What's special about these trampolines anyway? Any indirect function call could have these same problems -- an attacker could have overwritten the pointer the same way, whether it's supposed to point to a normal function or it is the target of this trampoline. For making them a bit safer, userspace could just map the page holding the data pointers/destination address(es) as read-only after initialization. > > Suggestions for other checks are most welcome! > > I would like to implement an option in the trampfd API. The user can > choose a secure trampoline or a performant trampoline. For a performant > trampoline, the kernel will generate the code. For a secure trampoline, > the kernel will do the work itself. > > In order to address the FFI_REGISTER ABI in libffi, we could use the secure > trampoline. In FFI_REGISTER, the data is pushed on the stack and the code > is jumped to without using any registers. > > As outlined in version 1, the kernel can push the data address on the stack > and write the code address into the PC and return to userland. > > For doing all of this, we need trampfd. We don't need this for FFI_REGISTER. I presented a solution that works in userspace. Even if you want to use a trampoline created by the kernel, there's no reason it needs to trap into the kernel at trampoline execution time. libffi's trampolines already handle this case today. > > Permitting the use of trampfd > ----------------------------- > > An "exectramp" setting can be implemented in SELinux to selectively allow the > use of trampfd for applications. > > Madhavan Applications can use their own userspace trampolines regardless of this setting, so it doesn't provide any additional security benefit by preventing usage of trampfd.
On 9/24/20 3:52 PM, Florian Weimer wrote: > * Madhavan T. Venkataraman: > >> Otherwise, using an ABI quirk or a calling convention side effect to >> load the PC into a GPR is, IMO, non-standard or non-compliant or >> non-approved or whatever you want to call it. I would be >> conservative and not use it. Who knows what incompatibility there >> will be with some future software or hardware features? > > AArch64 PAC makes a backwards-incompatible change that touches this > area, but we'll see if they can actually get away with it. > > In general, these things are baked into the ABI, even if they are not > spelled out explicitly in the psABI supplement. > >> For instance, in the i386 example, we do a call without a matching return. >> Also, we use a pop to undo the call. Can anyone tell me if this kind of use >> is an ABI approved one? > > Yes, for i386, this is completely valid from an ABI point of view. > It's equally possible to use a regular function call and just read the > return address that has been pushed to the stack. Then there's no > stack mismatch at all. Return stack predictors (including the one > used by SHSTK) also recognize the CALL 0 construct, so that's fine as > well. The i386 psABI does not use function descriptors, and either > approach (out-of-line thunk or CALL 0) is in common use to materialize > the program counter in a register and construct the GOT pointer. > >> If the kernel supplies this, then all applications and libraries can use >> it for all architectures with one single, simple API. Without this, each >> application/library has to roll its own solution for every architecture-ABI >> combo it wants to support. > > Is there any other user for these type-generic trampolines? > Everything else I've seen generates machine code specific to the > function being called. libffi is quite the outlier in my experience > because the trampoline calls a generic data-driven > marshaller/unmarshaller. The other trampoline generators put this > marshalling code directly into the generated trampoline. > > I'm still not convinced that this can't be done directly in libffi, > without kernel help. Hiding the architecture-specific code in the > kernel doesn't reduce overall system complexity. > See below. I have accepted the community's recommendation to implement it in user land. However, this is not just for libffi. It is for all dynamic code. libffi is just the first use case I am addressing with this. >> As an example, in libffi: >> >> ffi_closure_alloc() would call alloc_tramp() >> >> ffi_prep_closure_loc() would call init_tramp() >> >> ffi_closure_free() would call free_tramp() >> >> That is it! It works on all the architectures supported in the kernel for >> trampfd. > > ffi_prep_closure_loc would still need to check whether the trampoline > has been allocated by alloc_tramp because some applications supply > their own (executable and writable) mapping. ffi_closure_alloc would > need to support different sizes (not matching the trampoline). It's > also unclear to me to what extent software out there writes to the > trampoline data directly, bypassing the libffi API (the structs are > not opaque, after all). And all the existing libffi memory management > code (including the embedded dlmalloc copy) would be needed to support > kernels without trampfd for years to come. > In the libffi patch I have included, I have handled this. The closure structure contains a tramp field: char tramp[FFI_TRAMPOLINE_SIZE]; If trampfd is not used, this array will contain the actual trampoline code. If trampfd is used, then we don't need the array for storing any trampoline code. That space can be used for storing trampfd related information. So, there is no change to the closure structure. Also, the code can tell if the closure has been allocated from dlmalloc() called from ffi_closure_alloc() or has been allocated by the caller directly without calling ffi_closure_alloc(). I have written this function: int ffi_closure_alloc_called(void *closure) { msegmentptr seg = segment_holding (gm, closure); return (seg != NULL); } Using this function, I can tell how the closure has been allocated. I use trampfd only for closures that have been allocated using ffi_closure_alloc(). So, I believe I have handled all the cases. If I have missed anything, let me know. I will address it. > I very much agree that we have a gap in libffi when it comes to > JIT-less operation. But I'm not convinced that kernel support is > needed to close it, or that it is even the right design. > I have taken into account most of the comments received so far and I have come up with a proposal: I would like to do this in two separate RFCs: library RFC ----------- I accept the recommendation of the reviewers about implementing it in user land in a library. Just for the sake of context, I would like to reiterate the problem being solved and what the library will contain. Bear with me. My goal is to help convert existing dynamic code to static code as far as possible. The binary generated from the static code can be signed. The kernel can use signature verification to authenticate the code. This way, we don't need to disable W^X or make exceptions for the code (exemem etc) or use any user level methods to somehow map and execute the code. The dynamic code can be very simple like the libffi trampoline. Or, it can be a lot more complex. E.g., a trampoline that uses data marshaling as Florian mentioned. In all cases, when the code is converted to static code, the static code needs to know where its data will be located at runtime. If static code is a function, then one can just pass parameters. But if it is arbitrary code, then one needs a way to inform the static code where it can find its data. The code can use PC-relative referencing where available. For the sake of this discussion, let us assume that we can use some trick or the other to load the current PC into a GPR on all architectures. Then, we can use PC-relative referencing. Let us assume that these tricks will not cause ABI compliance issues in the future. The maintainer of the dynamic code who wishes to convert it to static code should not have to deal with all of these details. The static code should be able to assume that its data is pointed to by a designated register. Or, it should be able to assume that the data pointer has been pushed on the stack. Then, it is easier for maintainers to adopt this and move their code to a more secure model. This can be achieved by providing a small, minimal trampoline that loads the data pointer in a register or pushes it on the stack and jumps to the static code. The reviewers felt that the minimal trampoline can be provided in user land. So, I will provide a user library. The user library will: - define the minimal trampoline statically for different architectures using some flavor of PC-relative data referencing - provide a table of trampolines in a page - create and manage code and data pages - present a simple API to dynamic code maintainers This overall approach has pretty much been agreed upon by the community so far. I will send out an RFC for the library once I have the code ready. Which library? -------------- I need a recommendation from the community on this. Should I just place the code in glibc? Or, should I create a libtramp for this? I prefer glibc as it will make for easier adoption. But I will defer to the community on this. What do you recommend? trampfd RFC version 3 --------------------- Once the library RFC is accepted, I would, however, like to submit version 3 of trampfd. The library would support a choice of trampoline: - fast user trampoline described above - slow kernel trampoline described below that supports security checks each time the trampoline is invoked The minimal trampoline mentioned above would also be implemented in the kernel. The mechanism is outlined in version 1. When the application executes the trampoline, it would trap into the kernel and the kernel would do the work (load the data pointer in a user register or push it on the user stack and set the user PC to the target code and return). The kernel will perform security checks when the trampoline is invoked. For instance, to reduce or eliminate the possibility of the trampoline being used in an ROP/BOP chain. The checks are work in progress. But I think I can nail them. Note that there is no code generation involved in this proposal. The kernel is the trampoline. Would you guys be willing to consider this approach? Madhavan
On 9/24/20 6:43 PM, Arvind Sankar wrote: > On Thu, Sep 24, 2020 at 03:23:52PM -0500, Madhavan T. Venkataraman wrote: >> >> >>> Which ISA does not support PIC objects? You mentioned i386 below, but >>> i386 does support them, it just needs to copy the PC into a GPR first >>> (see below). >> >> Position Independent Code needs PC-relative branches. I was referring >> to PC-relative data references. Like RIP-relative data references in >> X64. i386 ISA does not support this. > > I was talking about PC-relative data references too: they are a > requirement for PIC code that wants to access any global data. They can > be implemented easily on i386 even though it doesn't have an addressing > mode that uses the PC. > >> Otherwise, using an ABI quirk or a calling convention side effect to load the >> PC into a GPR is, IMO, non-standard or non-compliant or non-approved or >> whatever you want to call it. I would be conservative and not use it. Who knows >> what incompatibility there will be with some future software or hardware >> features? >> >> For instance, in the i386 example, we do a call without a matching return. >> Also, we use a pop to undo the call. Can anyone tell me if this kind of use >> is an ABI approved one? > > This doesn't have anything to do with the ABI, since what happened here > isn't visible to any caller or callee. Any machine instruction sequence > that has the effect of copying the PC into a GPR is acceptable, but this > is basically the only possible solution on i386. If you don't like the > call/pop mismatch (though that's supported by the hardware, and is what > clang likes to use), you can use the slightly different technique used > in my example, which copies the top of stack into a GPR after a call. > > This is how all i386 PIC code has always worked. > I have responded to this in my reply to Florian. Basically, I accept the opinion of the reviewers. I will assume that any trick we use to get the current PC into a GPR will not cause ABI compliance issue in the future. >> Standard API for all userland for all architectures >> --------------------------------------------------- >> >> The next advantage in using the kernel is standardization. >> >> If the kernel supplies this, then all applications and libraries can use >> it for all architectures with one single, simple API. Without this, each >> application/library has to roll its own solution for every architecture-ABI >> combo it wants to support. > > But you can get even more standardization out of a userspace library, > because that can work even on non-linux OS's, as well as versions of > linux where the new syscall isn't available. > Dealing with old vs new kernels is the same as dealing with old vs new libs. In any case, what you have suggested above has already been suggested before and I have accepted everyone's opinion. Please see my response to Florian's email. >> >> Furthermore, if this work gets accepted, I plan to add a glibc wrapper for >> the kernel API. The glibc API would look something like this: >> >> Allocate a trampoline >> --------------------- >> >> tramp = alloc_tramp(); >> >> Set trampoline parameters >> ------------------------- >> >> init_tramp(tramp, code, data); >> >> Free the trampoline >> ------------------- >> >> free_tramp(tramp); >> >> glibc will allocate and manage the code and data tables, handle kernel API >> details and manage the trampoline table. > > glibc could do this already if it wants, even without the syscall, > because this can be done in userspace already. > I am wary of using ABI tricks or calling convention side-effects. However, since the reviewers feel it is OK, I have accepted that opinion. I have assumed now that any trick to load the current PC into a GPR can be used without any risk. I hope that assumption is correct. >> >> Secure vs Performant trampoline >> ------------------------------- >> >> If you recall, in version 1, I presented a trampoline type that is >> implemented in the kernel. When an application invokes the trampoline, >> it traps into the kernel and the kernel performs the work of the trampoline. >> >> The disadvantage is that a trip to the kernel is needed. That can be >> expensive. >> >> The advantage is that the kernel can add security checks before doing the >> work. Mainly, I am looking at checks that might prevent the trampoline >> from being used in an ROP/BOP chain. Some half-baked ideas: >> >> - Check that the invocation is at the starting point of the >> trampoline >> >> - Check if the trampoline is jumping to an allowed PC >> >> - Check if the trampoline is being invoked from an allowed >> calling PC or PC range >> >> Allowed PCs can be input using the trampfd API mentioned in version 1. >> Basically, an array of PCs is written into trampfd. > > The source PC will generally not be available if the compiler decided to > tail-call optimize the call to the trampoline into a jump. > This is still work in progress. But I am thinking that labels can be used. So, if the code is: invoke_tramp: (*tramp)(); then, invoke_tramp can be supplied as the calling PC. Similarly, labels can be used in assembly functions as well. Like I said, I have to think about this more. > What's special about these trampolines anyway? Any indirect function > call could have these same problems -- an attacker could have > overwritten the pointer the same way, whether it's supposed to point to > a normal function or it is the target of this trampoline. > > For making them a bit safer, userspace could just map the page holding > the data pointers/destination address(es) as read-only after > initialization. > You need to look at version 1 of trampfd for how to do "allowed pcs". As an example, libffi defines ABI handlers for every arch-ABI combo. These ABI handler pointers could be placed in an array in .rodata. Then, the array can be written into trampfd for setting allowed PCS. When the target PC is set for a trampoline, the kernel will check it against allowed PCs and reject it if it has been overwritten. >> >> Suggestions for other checks are most welcome! >> >> I would like to implement an option in the trampfd API. The user can >> choose a secure trampoline or a performant trampoline. For a performant >> trampoline, the kernel will generate the code. For a secure trampoline, >> the kernel will do the work itself. >> >> In order to address the FFI_REGISTER ABI in libffi, we could use the secure >> trampoline. In FFI_REGISTER, the data is pushed on the stack and the code >> is jumped to without using any registers. >> >> As outlined in version 1, the kernel can push the data address on the stack >> and write the code address into the PC and return to userland. >> >> For doing all of this, we need trampfd. > > We don't need this for FFI_REGISTER. I presented a solution that works > in userspace. Even if you want to use a trampoline created by the > kernel, there's no reason it needs to trap into the kernel at trampoline > execution time. libffi's trampolines already handle this case today. > libffi handles this using user level dynamic code which needs to be executed. If the security subsystem prevents that, then the dynamic code cannot execute. That is the whole point of this RFC. >> >> Permitting the use of trampfd >> ----------------------------- >> >> An "exectramp" setting can be implemented in SELinux to selectively allow the >> use of trampfd for applications. >> >> Madhavan > > Applications can use their own userspace trampolines regardless of this > setting, so it doesn't provide any additional security benefit by > preventing usage of trampfd. > The background for all of this is that dynamic code such as trampolines need to be placed in a page with executable permissions so they can execute. If security measures such as W^X are present, this will not be possible. Admitted, today some user level tricks exist to get around W^X. I have alluded to those. IMO, they are all security holes and will get plugged sooner or later. Then, these trampolines cannot execute. Currently, there exist security exceptions such as execmem to let them execute. But we would like to do it without making security exceptions. Madhavan
On Fri, Sep 25, 2020 at 05:44:56PM -0500, Madhavan T. Venkataraman wrote: > > > On 9/24/20 6:43 PM, Arvind Sankar wrote: > > > > The source PC will generally not be available if the compiler decided to > > tail-call optimize the call to the trampoline into a jump. > > > > This is still work in progress. But I am thinking that labels can be used. > So, if the code is: > > invoke_tramp: > (*tramp)(); > > then, invoke_tramp can be supplied as the calling PC. > > Similarly, labels can be used in assembly functions as well. > > Like I said, I have to think about this more. What I mean is that the kernel won't have access to the actual source PC. If I followed your v1 correctly, it works by making any branch to the trampoline code trigger a page fault. At this point, the PC has already been updated to the trampoline entry, so the only thing the fault handler can know is the return address on the top of the stack, which (a) might not be where the branch actually originated, either because it was a jump, or you've already been hacked and you got here using a ret; (b) is available to userspace anyway. > > > What's special about these trampolines anyway? Any indirect function > > call could have these same problems -- an attacker could have > > overwritten the pointer the same way, whether it's supposed to point to > > a normal function or it is the target of this trampoline. > > > > For making them a bit safer, userspace could just map the page holding > > the data pointers/destination address(es) as read-only after > > initialization. > > > > You need to look at version 1 of trampfd for how to do "allowed pcs". > As an example, libffi defines ABI handlers for every arch-ABI combo. > These ABI handler pointers could be placed in an array in .rodata. > Then, the array can be written into trampfd for setting allowed PCS. > When the target PC is set for a trampoline, the kernel will check > it against allowed PCs and reject it if it has been overwritten. I'm not asking how it's implemented. I'm asking what's the point? On a typical linux system, at least on x86, every library function call is an indirect branch. The protection they get is that the dynamic linker can map the pointer table read-only after initializing it. For the RO mapping, libffi could be mapping both the entire closure structure, as well as the structure that describes the arguments and return types of the function, read-only once they are initialized. For libffi, there are three indirect branches for every trampoline call with your suggested trampoline: one to get to the trampoline, one to jump to the handler, and one to call the actual user function. If we are particularly concerned about the trampoline to handler branch for some reason, we could just replace it with a direct branch: if the kernel was generating the code, there's no reason to allow the data pointer or code target to be changed after the trampoline was created. It can just hard-code them in the generated code and be done with it. Even with user-space trampolines, you can use a direct call. All you need is libffi-trampoline.so which contains a few thousand trampolines all jumping to one handler, which then decides what to do based on which trampoline was called. Sure libffi currently dispatches to one of 2-3 handlers based on the ABI, but there's no technical reason it couldn't dispatch to just one that handled all the ABIs, and the trampoline could be boiled down to just: endbr call handler ret > >> > >> In order to address the FFI_REGISTER ABI in libffi, we could use the secure > >> trampoline. In FFI_REGISTER, the data is pushed on the stack and the code > >> is jumped to without using any registers. > >> > >> As outlined in version 1, the kernel can push the data address on the stack > >> and write the code address into the PC and return to userland. > >> > >> For doing all of this, we need trampfd. > > > > We don't need this for FFI_REGISTER. I presented a solution that works > > in userspace. Even if you want to use a trampoline created by the > > kernel, there's no reason it needs to trap into the kernel at trampoline > > execution time. libffi's trampolines already handle this case today. > > > > libffi handles this using user level dynamic code which needs to be executed. > If the security subsystem prevents that, then the dynamic code cannot execute. > That is the whole point of this RFC. /If/ you are using a trampoline created by the kernel, it can just create the one that libffi is using today; which doesn't need trapping into the kernel at execution time. And if you aren't, you can use the trampoline I wrote, which has no dynamic code, and doesn't need to trap into the kernel at execution time either. > > >> > >> Permitting the use of trampfd > >> ----------------------------- > >> > >> An "exectramp" setting can be implemented in SELinux to selectively allow the > >> use of trampfd for applications. > >> > >> Madhavan > > > > Applications can use their own userspace trampolines regardless of this > > setting, so it doesn't provide any additional security benefit by > > preventing usage of trampfd. > > > > The background for all of this is that dynamic code such as trampolines > need to be placed in a page with executable permissions so they can > execute. If security measures such as W^X are present, this will not > be possible. Admitted, today some user level tricks exist to get around > W^X. I have alluded to those. IMO, they are all security holes and will > get plugged sooner or later. Then, these trampolines cannot execute. > Currently, there exist security exceptions such as execmem to let them > execute. But we would like to do it without making security exceptions. > > Madhavan How can you still say this after this whole discussion? Applications can get the exact same functionality as your proposed trampfd using static code, no W^X tricks needed. This only matters if you have a trampfd that generates _truly_ dynamic code, not just code that can be trivially made static.
On 9/26/20 10:55 AM, Arvind Sankar wrote: > On Fri, Sep 25, 2020 at 05:44:56PM -0500, Madhavan T. Venkataraman wrote: >> >> >> On 9/24/20 6:43 PM, Arvind Sankar wrote: >>> >>> The source PC will generally not be available if the compiler decided to >>> tail-call optimize the call to the trampoline into a jump. >>> >> >> This is still work in progress. But I am thinking that labels can be used. >> So, if the code is: >> >> invoke_tramp: >> (*tramp)(); >> >> then, invoke_tramp can be supplied as the calling PC. >> >> Similarly, labels can be used in assembly functions as well. >> >> Like I said, I have to think about this more. > > What I mean is that the kernel won't have access to the actual source > PC. If I followed your v1 correctly, it works by making any branch to > the trampoline code trigger a page fault. At this point, the PC has > already been updated to the trampoline entry, so the only thing the > fault handler can know is the return address on the top of the stack, > which (a) might not be where the branch actually originated, either > because it was a jump, or you've already been hacked and you got here > using a ret; (b) is available to userspace anyway. Like I said, this is work in progress. I have to spend time to figure out how this would work or if this would work. So, let us brainstorm this a little bit. There are two ways to invoke the trampoline: (1) By just branching to the trampoline address. (2) Or, by treating the address as a function pointer and calling it. In the libffi case, it is (2). If it is (2), it is easier. We can figure out the return address of the call which would be the location after the call instruction. If it is (1), it is harder as you point out. So, we can support this at least for (2). The user can inform trampfd as to the type of invocation for the trampoline. For (1), the return address would be that of the call to the function that contains the branch. If the kernel can get that call instruction and figure out the function address, then we can do something. I admit this is bit hairy at the moment. I have to work it out. > >> >>> What's special about these trampolines anyway? Any indirect function >>> call could have these same problems -- an attacker could have >>> overwritten the pointer the same way, whether it's supposed to point to >>> a normal function or it is the target of this trampoline. >>> >>> For making them a bit safer, userspace could just map the page holding >>> the data pointers/destination address(es) as read-only after >>> initialization. >>> >> >> You need to look at version 1 of trampfd for how to do "allowed pcs". >> As an example, libffi defines ABI handlers for every arch-ABI combo. >> These ABI handler pointers could be placed in an array in .rodata. >> Then, the array can be written into trampfd for setting allowed PCS. >> When the target PC is set for a trampoline, the kernel will check >> it against allowed PCs and reject it if it has been overwritten. > > I'm not asking how it's implemented. I'm asking what's the point? On a > typical linux system, at least on x86, every library function call is an > indirect branch. The protection they get is that the dynamic linker can > map the pointer table read-only after initializing it. > The security subsystem is concerned about dynamic code, not the indirect branches set up for dynamic linking. > For the RO mapping, libffi could be mapping both the entire closure > structure, as well as the structure that describes the arguments and > return types of the function, read-only once they are initialized. > This has been suggested in some form before. The general problem with this approach is that when the page is still writable, an attacker can inject his code potentially. Making the page read-only after the fact may not help. In specific use cases, it may work. But it is not OK as a general approach to solving this problem. > For libffi, there are three indirect branches for every trampoline call > with your suggested trampoline: one to get to the trampoline, one to > jump to the handler, and one to call the actual user function. If we are > particularly concerned about the trampoline to handler branch for some > reason, we could just replace it with a direct branch: if the kernel was > generating the code, there's no reason to allow the data pointer or code > target to be changed after the trampoline was created. It can just > hard-code them in the generated code and be done with it. Even with > user-space trampolines, you can use a direct call. All you need is > libffi-trampoline.so which contains a few thousand trampolines all > jumping to one handler, which then decides what to do based on which > trampoline was called. Sure libffi currently dispatches to one of 2-3 > handlers based on the ABI, but there's no technical reason it couldn't > dispatch to just one that handled all the ABIs, and the trampoline could > be boiled down to just: > endbr > call handler > ret > One still needs this trampoline: load closure in some register jump to single_handler In the kernel based solution, the user would specify to the kernel the target PC in a code context. pwrite(trampfd, code_context, size, CODE_OFFSET); code_context itself can be hacked unless it is in .rodata. The allowed_pcs thing exists for apps/libs that are unable or unwilling to place code_context in .rodata. I would like to not just focus how to solve things for libffi alone. >>>> >>>> In order to address the FFI_REGISTER ABI in libffi, we could use the secure >>>> trampoline. In FFI_REGISTER, the data is pushed on the stack and the code >>>> is jumped to without using any registers. >>>> >>>> As outlined in version 1, the kernel can push the data address on the stack >>>> and write the code address into the PC and return to userland. >>>> >>>> For doing all of this, we need trampfd. >>> >>> We don't need this for FFI_REGISTER. I presented a solution that works >>> in userspace. Even if you want to use a trampoline created by the >>> kernel, there's no reason it needs to trap into the kernel at trampoline >>> execution time. libffi's trampolines already handle this case today. >>> >> >> libffi handles this using user level dynamic code which needs to be executed. >> If the security subsystem prevents that, then the dynamic code cannot execute. >> That is the whole point of this RFC. > > /If/ you are using a trampoline created by the kernel, it can just > create the one that libffi is using today; which doesn't need trapping > into the kernel at execution time. > > And if you aren't, you can use the trampoline I wrote, which has no > dynamic code, and doesn't need to trap into the kernel at execution time > either. > The kernel based solution gives you the opportunity to make additional security checks at the time a trampoline is invoked. A purely user level solution cannot do that. E.g., I would like to prevent even the minimal trampoline from being used in BOP/ROP chains. >> >>>> >>>> Permitting the use of trampfd >>>> ----------------------------- >>>> >>>> An "exectramp" setting can be implemented in SELinux to selectively allow the >>>> use of trampfd for applications. >>>> >>>> Madhavan >>> >>> Applications can use their own userspace trampolines regardless of this >>> setting, so it doesn't provide any additional security benefit by >>> preventing usage of trampfd. >>> >> >> The background for all of this is that dynamic code such as trampolines >> need to be placed in a page with executable permissions so they can >> execute. If security measures such as W^X are present, this will not >> be possible. Admitted, today some user level tricks exist to get around >> W^X. I have alluded to those. IMO, they are all security holes and will >> get plugged sooner or later. Then, these trampolines cannot execute. >> Currently, there exist security exceptions such as execmem to let them >> execute. But we would like to do it without making security exceptions. >> >> Madhavan > > How can you still say this after this whole discussion? Applications can > get the exact same functionality as your proposed trampfd using static > code, no W^X tricks needed. > > This only matters if you have a trampfd that generates _truly_ dynamic > code, not just code that can be trivially made static. > How can *you* still say this after all this discussion? I have already explained all of this. The trivial bootstrap trampoline can be provided in a user library as well the kernel. The user land solution provides a fast trampoline that does the job. The kernel solution is slower but allows for additional security checks that a user land solution does not allow. IMO, it should be a choice what type of trampoline the user wants. And this is not just for libffi that we can somehow do this within libffi. I would like to provide something so that the maintainers of other dynamic code can use it to convert their dynamic code to static code when their dynamic code is a lot more complex that the libffi trampoline. I am already willing to implement a user land only solution. I don't see the problem. Madhavan
Before I implement the user land solution recommended by reviewers, I just want an opinion on where the code should reside. I am thinking glibc. The other choice would be a separate library, say, libtramp. What do you recommend? Madhavan
> And this is not just for libffi that we can somehow do this within libffi. > I would like to provide something so that the maintainers of other > dynamic code can use it to convert their dynamic code to static code > when their dynamic code is a lot more complex that the libffi trampoline. Having worked on stuff "like" this -- removing "arbitrary" codegen from a system and replacing with "templatized" codegen, because the runtime banned runtime codegen, and despite being a lover of shared source and shared libraries, I'm afraid to say, this is not an area very amenable to sharing. Specifically I've done this twice. Providing examples is good and people will copy/paste. The problem can be sort of split up into parts: - The management of a pool of thunks. - The thunks. Where I mostly give up is: - Generalizing the thunks, such as to share them. the management of the pool is kinda sorta generalizable, but the thunks, again, it is difficult/impossible to share. I do think, there is *some* opportunity here. Stuff like, for some function f(x,y), produce a new function f2(y,x) that swaps params and calls f or a new function f3(x), that sets y to a constant and calls f. Like my favorite Scheme-ish: Given a static binary function: function add(x, y) (+ x y) Provide for dynamically creating specialized unary functions: function make-add(x): return function addx(y) (+ x y) And then generalized to arbitrary rearrangement and hardcoding of parameters. I believe this is libffi, and might be able to replace some people's codegens. It sounds a bit contrived, but I know this actually resembles real world cases. Consider some library that accepts function pointers but fails to accept an additional void* to pass on to them. qsort/bsearch are the classic broken-ish cases. Wrapping Windows WNDPROCs in C++ are another -- you want a "thunk" to take the Win32-defined parameters, and add a this pointer as well. So you create a new function and when you create the function you give it the this pointer to hardcode within it. i.e. atlthunk. > The kernel based solution gives you the opportunity to make additional > security checks at the time a trampoline is invoked. A purely user level > solution cannot do that. E.g., I would like to prevent even the minimal > trampoline from being used in BOP/ROP chains. Like what? At some point, it is just normal static code. Once libffi is fixed, so that the iOS solution is available on all platforms, it is all just normal code. There are no checks to apply differently to libffi and its output than any other code, right? > Before I implement the user land solution recommended by reviewers, I just want > an opinion on where the code should reside. > > I am thinking glibc. The other choice would be a separate library, say, libtramp. > What do you recommend? What functionality does the user land solution provide? I suggest, other than lobbying the libffi developers to do their part, and perhaps giving in and doing it yourself, identity some other dynamic but non-arbitrary code generations that you wish to fix and work through fixing it. See what patterns emerge. - Jay
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com> Introduction ============ Dynamic code is used in many different user applications. Dynamic code is often generated at runtime. Dynamic code can also just be a pre-defined sequence of machine instructions in a data buffer. Examples of dynamic code are trampolines, JIT code, DBT code, etc. Dynamic code is placed either in a data page or in a stack page. In order to execute dynamic code, the page it resides in needs to be mapped with execute permissions. Writable pages with execute permissions provide an attack surface for hackers. Attackers can use this to inject malicious code, modify existing code or do other harm. To mitigate this, LSMs such as SELinux implement W^X. That is, they may not allow pages to have both write and execute permissions. This prevents dynamic code from executing and blocks applications that use it. To allow genuine applications to run, exceptions have to be made for them (by setting execmem, etc) which opens the door to security issues. The W^X implementation today is not complete. There exist many user level tricks that can be used to load and execute dynamic code. E.g., - Load the code into a file and map the file with R-X. - Load the code in an RW- page. Change the permissions to R--. Then, change the permissions to R-X. - Load the code in an RW- page. Remap the page with R-X to get a separate mapping to the same underlying physical page. IMO, these are all security holes as an attacker can exploit them to inject his own code. In the future, these holes will definitely be closed. For instance, LSMs (such as the IPE proposal [1]) may only allow code in properly signed object files to be mapped with execute permissions. This will do two things: - user level tricks using anonymous pages will fail as anonymous pages have no file identity - loading the code in a temporary file and mapping it with R-X will fail as the temporary file would not have a signature We need a way to execute such code without making security exceptions. Trampolines are a good example of dynamic code. A couple of examples of trampolines are given below. My first use case for this RFC is libffi. Examples of trampolines ======================= libffi (A Portable Foreign Function Interface Library): libffi allows a user to define functions with an arbitrary list of arguments and return value through a feature called "Closures". Closures use trampolines to jump to ABI handlers that handle calling conventions and call a target function. libffi is used by a lot of different applications. To name a few: - Python - Java - Javascript - Ruby FFI - Lisp - Objective C GCC nested functions: GCC has traditionally used trampolines for implementing nested functions. The trampoline is placed on the user stack. So, the stack needs to be executable. Currently available solution ============================ One solution that has been proposed to allow trampolines to be executed without making security exceptions is Trampoline Emulation. See: https://pax.grsecurity.net/docs/emutramp.txt In this solution, the kernel recognizes certain sequences of instructions as "well-known" trampolines. When such a trampoline is executed, a page fault happens because the trampoline page does not have execute permission. The kernel recognizes the trampoline and emulates it. Basically, the kernel does the work of the trampoline on behalf of the application. Currently, the emulated trampolines are the ones used in libffi and GCC nested functions. To my knowledge, only X86 is supported at this time. As noted in emutramp.txt, this is not a generic solution. For every new trampoline that needs to be supported, new instruction sequences need to be recognized by the kernel and emulated. And this has to be done for every architecture that needs to be supported. emutramp.txt notes the following: "... the real solution is not in emulation but by designing a kernel API for runtime code generation and modifying userland to make use of it." Solution proposed in this RFC ============================= From this RFC's perspective, there are two scenarios for dynamic code: Scenario 1 ---------- We know what code we need only at runtime. For instance, JIT code generated for frequently executed Java methods. Only at runtime do we know what methods need to be JIT compiled. Such code cannot be statically defined. It has to be generated at runtime. Scenario 2 ---------- We know what code we need in advance. User trampolines are a good example of this. It is possible to define such code statically with some help from the kernel. This RFC addresses (2). (1) needs a general purpose trusted code generator and is out of scope for this RFC. For (2), the solution is to convert dynamic code to static code and place it in a source file. The binary generated from the source can be signed. The kernel can use signature verification to authenticate the binary and allow the code to be mapped and executed. The problem is that the static code has to be able to find the data that it needs when it executes. For functions, the ABI defines the way to pass parameters. But, for arbitrary dynamic code, there isn't a standard ABI compliant way to pass data to the code for most architectures. Each instance of dynamic code defines its own way. For instance, co-location of code and data and PC-relative data referencing are used in cases where the ISA supports it. We need one standard way that would work for all architectures and ABIs. The solution proposed here is: 1. Write the static code assuming that the data needed by the code is already pointed to by a designated register. 2. Get the kernel to supply a small universal trampoline that does the following: - Load the address of the data in a designated register - Load the address of the static code in a designated register - Jump to the static code User code would use a kernel supplied API to create and map the trampoline. The address values would be baked into the code so that no special ISA features are needed. To conserve memory, the kernel will pack as many trampolines as possible in a page and provide a trampoline table to user code. The table itself is managed by the user. Trampoline File Descriptor (trampfd) ========================== I am proposing a kernel API using anonymous file descriptors that can be used to create the trampolines. The API is described in patch 1/4 of this patchset. I provide a summary here: - Create a trampoline file object - Write a code descriptor into the trampoline file and specify: - the number of trampolines desired - the name of the code register - user pointer to a table of code addresses, one address per trampoline - Write a data descriptor into the trampoline file and specify: - the name of the data register - user pointer to a table of data addresses, one address per trampoline - mmap() the trampoline file. The kernel generates a table of trampolines in a page and returns the trampoline table address - munmap() a trampoline file mapping - Close the trampoline file Each mmap() will only map a single base page. Large pages are not supported. A trampoline file can only be mapped once in an address space. Trampoline file mappings cannot be shared across address spaces. So, sending the trampoline file descriptor over a unix domain socket and mapping it in another process will not work. It is recommended that the code descriptor and the code table be placed in the .rodata section so an attacker cannot modify them. Trampoline use and reuse ======================== The code for trampoline X in the trampoline table is: load &code_table[X], code_reg load (code_reg), code_reg load &data_table[X], data_reg load (data_reg), data_reg jump code_reg The addresses &code_table[X] and &data_table[X] are baked into the trampoline code. So, PC-relative data references are not needed. The user can modify code_table[X] and data_table[X] dynamically. For instance, within libffi, the same trampoline X can be used for different closures at different times by setting: data_table[X] = closure; code_table[X] = ABI handling code; Advantages of the Trampoline File Descriptor approach ===================================================== - Using this support from the kernel, dynamic code can be converted to static code with a little effort so applications and libraries can move to a more secure model. In the simplest cases such as libffi, dynamic code can even be eliminated. - This initial work is targeted towards X86 and ARM. But it can be supported easily on all architectures. We don't need any special ISA features such as PC-relative data referencing. - The only code generation needed is for this small, universal trampoline. - The kernel does not have to deal with any ABI issues in the generation of this trampoline. - The kernel provides a trampoline table to conserve memory. - An SELinux setting called "exectramp" can be implemented along the lines of "execmem", "execstack" and "execheap" to selectively allow the use of trampolines on a per application basis. - In version 1, a trip to the kernel was required to execute the trampoline. In version 2, that is not required. So, there are no performance concerns in this approach. libffi ====== I have implemented my solution for libffi and provided the changes for X86 and ARM, 32-bit and 64-bit. Here is the reference patch: http://linux.microsoft.com/~madvenka/libffi/libffi.v2.txt If the trampfd patchset gets accepted, I will send the libffi changes to the maintainers for a review. BTW, I have also successfully executed the libffi self tests. Work that is pending ==================== - I am working on implementing the SELinux setting - "exectramp". - I have a test program to test the kernel API. I am working on adding it to selftests. References ========== [1] https://microsoft.github.io/ipe/ --- Changelog: v1 Introduced the Trampfd feature. v2 - Changed the system call. Version 2 does not support different trampoline types and their associated type structures. It only supports a kernel generated trampoline. The system call now returns information to the user that is used to define trampoline descriptors. E.g., the maximum number of trampolines that can be packed in a single page. - Removed all the trampoline contexts such as register contexts and stack contexts. This is based on the feedback that the kernel should not have to worry about ABI issues and H/W features that may deal with the context of a process. - Removed the need to make a trip into the kernel on trampoline invocation. This is based on the feedback about performance. - Removed the ability to share trampolines across address spaces. This would have made sense to different trampoline types based on their semantics. But since I support only one specific trampoline, sharing does not make sense. - Added calls to specify trampoline descriptors that the kernel uses to generate trampolines. - Added architecture-specific code to generate the small, universal trampoline for X86 32 and 64-bit, ARM 32 and 64-bit. - Implemented the trampoline table in a page. Madhavan T. Venkataraman (4): Implement the kernel API for the trampoline file descriptor. Implement i386 and X86 support for the trampoline file descriptor. Implement ARM64 support for the trampoline file descriptor. Implement ARM support for the trampoline file descriptor. arch/arm/include/uapi/asm/ptrace.h | 21 +++ arch/arm/kernel/Makefile | 1 + arch/arm/kernel/trampfd.c | 124 +++++++++++++ arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/arm64/include/uapi/asm/ptrace.h | 59 ++++++ arch/arm64/kernel/Makefile | 2 + arch/arm64/kernel/trampfd.c | 244 +++++++++++++++++++++++++ arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/include/uapi/asm/ptrace.h | 38 ++++ arch/x86/kernel/Makefile | 1 + arch/x86/kernel/trampfd.c | 238 ++++++++++++++++++++++++ fs/Makefile | 1 + fs/trampfd/Makefile | 5 + fs/trampfd/trampfd_fops.c | 241 ++++++++++++++++++++++++ fs/trampfd/trampfd_map.c | 142 ++++++++++++++ include/linux/syscalls.h | 2 + include/linux/trampfd.h | 49 +++++ include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/trampfd.h | 184 +++++++++++++++++++ init/Kconfig | 7 + kernel/sys_ni.c | 3 + 24 files changed, 1371 insertions(+), 2 deletions(-) create mode 100644 arch/arm/kernel/trampfd.c create mode 100644 arch/arm64/kernel/trampfd.c create mode 100644 arch/x86/kernel/trampfd.c create mode 100644 fs/trampfd/Makefile create mode 100644 fs/trampfd/trampfd_fops.c create mode 100644 fs/trampfd/trampfd_map.c create mode 100644 include/linux/trampfd.h create mode 100644 include/uapi/linux/trampfd.h