Message ID | 20230227222957.24501-2-rick.p.edgecombe@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Shadow stacks for userspace | expand |
The 02/27/2023 14:29, Rick Edgecombe wrote: > +Application Enabling > +==================== > + > +An application's CET capability is marked in its ELF note and can be verified > +from readelf/llvm-readelf output:: > + > + readelf -n <application> | grep -a SHSTK > + properties: x86 feature: SHSTK > + > +The kernel does not process these applications markers directly. Applications > +or loaders must enable CET features using the interface described in section 4. > +Typically this would be done in dynamic loader or static runtime objects, as is > +the case in GLIBC. Note that this has to be an early decision in libc (ld.so or static exe start code), which will be difficult to hook into system wide security policy settings. (e.g. to force shstk on marked binaries.) From userspace POV I'd prefer if a static exe did not have to parse its own ELF notes (i.e. kernel enabled shstk based on the marking). But I realize if there is a need for complex shstk enable/disable decision that is better in userspace and if the kernel decision can be overridden then it might as well all be in userspace. > +Enabling arch_prctl()'s > +======================= > + > +Elf features should be enabled by the loader using the below arch_prctl's. They > +are only supported in 64 bit user applications. > + > +arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature) > + Enable a single feature specified in 'feature'. Can only operate on > + one feature at a time. > + > +arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature) > + Disable a single feature specified in 'feature'. Can only operate on > + one feature at a time. > + > +arch_prctl(ARCH_SHSTK_LOCK, unsigned long features) > + Lock in features at their current enabled or disabled status. 'features' > + is a mask of all features to lock. All bits set are processed, unset bits > + are ignored. The mask is ORed with the existing value. So any feature bits > + set here cannot be enabled or disabled afterwards. The multi-thread behaviour should be documented here: Only the current thread is affected. So an application can only change the setting while single-threaded which is only guaranteed before any user code is executed. Later using the prctl is complicated and most c runtimes would not want to do that (async signalling all threads and prctl from the handler). In particular these interfaces are not suitable to turn shstk off at dlopen time when an unmarked binary is loaded. Or any other late shstk policy change will not work, so as far as i can see the "permissive" mode in glibc does not work. Does the main thread have shadow stack allocated before shstk is enabled? is the shadow stack freed when it is disabled? (e.g. what would the instruction reading the SSP do in disabled state?) > +Proc Status > +=========== > +To check if an application is actually running with shadow stack, the > +user can read the /proc/$PID/status. It will report "wrss" or "shstk" > +depending on what is enabled. The lines look like this:: > + > + x86_Thread_features: shstk wrss > + x86_Thread_features_locked: shstk wrss Presumaly /proc/$TID/status and /proc/$PID/task/$TID/status also shows the setting and only valid for the specific thread (not the entire process). So i would note that this for one thread only. > +Implementation of the Shadow Stack > +================================== > + > +Shadow Stack Size > +----------------- > + > +A task's shadow stack is allocated from memory to a fixed size of > +MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to > +the maximum size of the normal stack, but capped to 4 GB. However, > +a compat-mode application's address space is smaller, each of its thread's > +shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB). This policy tries to handle all threads with the same shadow stack size logic, which has limitations. I think it should be improved (otherwise some applications will have to turn shstk off): - RLIMIT_STACK is not an upper bound for the main thread stack size (rlimit can increase/decrease dynamically). - RLIMIT_STACK only applies to the main thread, so it is not an upper bound for non-main thread stacks. - i.e. stack size >> startup RLIMIT_STACK is possible and then shadow stack can overflow. - stack size << startup RLIMIT_STACK is also possible and then VA space is wasted (can lead to OOM with strict memory overcommit). - clone3 tells the kernel the thread stack size so that should be used instead of RLIMIT_STACK. (clone does not though.) - I think it's better to have a new limit specifically for shadow stack size (which by default can be RLIMIT_STACK) so userspace can adjust it if needed (another reason is that stack size is not always a good indicator of max call depth). > +Signal > +------ > + > +By default, the main program and its signal handlers use the same shadow > +stack. Because the shadow stack stores only return addresses, a large > +shadow stack covers the condition that both the program stack and the > +signal alternate stack run out. What does "by default" mean here? Is there a case when the signal handler is not entered with SSP set to the handling thread'd shadow stack? > +When a signal happens, the old pre-signal state is pushed on the stack. When > +shadow stack is enabled, the shadow stack specific state is pushed onto the > +shadow stack. Today this is only the old SSP (shadow stack pointer), pushed > +in a special format with bit 63 set. On sigreturn this old SSP token is > +verified and restored by the kernel. The kernel will also push the normal > +restorer address to the shadow stack to help userspace avoid a shadow stack > +violation on the sigreturn path that goes through the restorer. The kernel pushes on the shadow stack on signal entry so shadow stack overflow cannot be handled. Please document this as non-recoverable failure. I think it can be made recoverable if signals with alternate stack run on a different shadow stack. And the top of the thread shadow stack is just corrupted instead of pushed in the overflow case. Then longjmp out can be made to work (common in stack overflow handling cases), and reliable crash report from the signal handler works (also common). Does SSP get stored into the sigcontext struct somewhere? > +Fork > +---- > + > +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required > +to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a > +shadow access triggers a page fault with the shadow stack access bit set > +in the page fault error code. > + > +When a task forks a child, its shadow stack PTEs are copied and both the > +parent's and the child's shadow stack PTEs are cleared of the dirty bit. > +Upon the next shadow stack access, the resulting shadow stack page fault > +is handled by page copy/re-use. > + > +When a pthread child is created, the kernel allocates a new shadow stack > +for the new thread. New shadow stack's behave like mmap() with respect to > +ASLR behavior. Please document the shadow stack lifetimes here: I think thread exit unmaps shadow stack and vfork shares shadow stack with parent so exit does not unmap. I think the map_shadow_stack syscall should be mentioned in this document too. ABI for initial shadow stack entries: If one wants to scan the shadow stack how to detect the end (e.g. fast backtrace)? Is it useful to put an invalid value (-1) there? (affects map_shadow_stack syscall too). thanks. IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
The 03/01/2023 14:21, Szabolcs Nagy wrote: >... > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. sorry, ignore this.
On Wed, 2023-03-01 at 14:21 +0000, Szabolcs Nagy wrote: > The 02/27/2023 14:29, Rick Edgecombe wrote: > > +Application Enabling > > +==================== > > + > > +An application's CET capability is marked in its ELF note and can > > be verified > > +from readelf/llvm-readelf output:: > > + > > + readelf -n <application> | grep -a SHSTK > > + properties: x86 feature: SHSTK > > + > > +The kernel does not process these applications markers directly. > > Applications > > +or loaders must enable CET features using the interface described > > in section 4. > > +Typically this would be done in dynamic loader or static runtime > > objects, as is > > +the case in GLIBC. > > Note that this has to be an early decision in libc (ld.so or static > exe start code), which will be difficult to hook into system wide > security policy settings. (e.g. to force shstk on marked binaries.) In the eager enabling (by the kernel) scenario, how is this improved? The loader has to have the option to disable the shadow stack if enabling conditions are not met, so it still has to trust userspace to not do that. Did you have any more specifics on how the policy would work? > > From userspace POV I'd prefer if a static exe did not have to parse > its own ELF notes (i.e. kernel enabled shstk based on the marking). This is actually exactly what happens in the glibc patches. My understand was that it already been discussed amongst glibc folks. > But I realize if there is a need for complex shstk enable/disable > decision that is better in userspace and if the kernel decision can > be overridden then it might as well all be in userspace. A complication with shadow stack in general is that it has to be enabled very early. Otherwise when the program returns from main(), it will get a shadow stack underflow. The old logic in this series would enable shadow stack if the loader had the SHSTK bit (by parsing the header in the kernel). Then later if the conditions were not met to use shadow stack, the loader would call into the kernel again to disable shadow stack. One problem (there were several with this area) with this eager enabling, was the kernel ended up mapping, briefly using, and then unmapping the shadow stack in the case of a executable not supporting shadow stack. What the glibc patches do today is pretty much the same behavior as before, just with the header parsing moved into userspace. I think letting the component with the most information make the decision leaves open the best opportunity for making it efficient. I wonder if it could be possible for glibc to enable it later than it currently does in the patches and improve the dynamic loader case, but I don't know enough of that code. > > > +Enabling arch_prctl()'s > > +======================= > > + > > +Elf features should be enabled by the loader using the below > > arch_prctl's. They > > +are only supported in 64 bit user applications. > > + > > +arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature) > > + Enable a single feature specified in 'feature'. Can only > > operate on > > + one feature at a time. > > + > > +arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature) > > + Disable a single feature specified in 'feature'. Can only > > operate on > > + one feature at a time. > > + > > +arch_prctl(ARCH_SHSTK_LOCK, unsigned long features) > > + Lock in features at their current enabled or disabled status. > > 'features' > > + is a mask of all features to lock. All bits set are processed, > > unset bits > > + are ignored. The mask is ORed with the existing value. So any > > feature bits > > + set here cannot be enabled or disabled afterwards. > > The multi-thread behaviour should be documented here: Only the > current thread is affected. So an application can only change the > setting while single-threaded which is only guaranteed before any > user code is executed. Later using the prctl is complicated and > most c runtimes would not want to do that (async signalling all > threads and prctl from the handler). It is kind of covered in the fork() docs, but yes there should probably be a reference here too. > > In particular these interfaces are not suitable to turn shstk off > at dlopen time when an unmarked binary is loaded. Or any other > late shstk policy change will not work, so as far as i can see > the "permissive" mode in glibc does not work. Yes, that is correct. Glibc permissive mode does not fully work. There are some ongoing discussions on how to make it work. Some options don't require kernel changes, and some do. Making it per-thread is complicated for x86 because when shadow stack is off, some of the special shadow stack instructions will cause #UD exception. Glibc (any probably other apps in the future) could be in the middle of executing these instructions when dlopen() was called. So if there was a process wide disable option it would have to be resilient to these #UDs. And even then the code that used them could not be guaranteed to continue to work. For example, if you call the gcc intrinsic _get_ssp() when shadow stack is enabled it could be expected to point to the shadow stack in most cases. If shadow stack gets disabled, rdssp will return 0, in which case reading the shadow stack would segfault. So the all- process disabling solution can't be fully robust when there is any shadow stack specific logic. The other option discussed was creating trampolines between the linked legacy objects that could know to tell the kernel to disable shadow stack if needed. In this case, shadow stack is disabled for each thread as it calls into the DSO. It's not clear if there can be enough information gleaned from the legacy binaries to know when to generate the trampolines in exotic cases. A third option might be to have some synchronization between the kernel and userspace around anything using the shadow stack instructions. But there is not much detail filled in there. So in summary, it's not as simple as making the disable per-process. > > Does the main thread have shadow stack allocated before shstk is > enabled? No. > is the shadow stack freed when it is disabled? (e.g. > what would the instruction reading the SSP do in disabled state?) Yes. When shadow stack is disabled rdssp is a NOP, the intrinsic returns NULL. > > > +Proc Status > > +=========== > > +To check if an application is actually running with shadow stack, > > the > > +user can read the /proc/$PID/status. It will report "wrss" or > > "shstk" > > +depending on what is enabled. The lines look like this:: > > + > > + x86_Thread_features: shstk wrss > > + x86_Thread_features_locked: shstk wrss > > Presumaly /proc/$TID/status and /proc/$PID/task/$TID/status also > shows the setting and only valid for the specific thread (not the > entire process). So i would note that this for one thread only. Since enabling/disabling is per-thread, and the field is called "x86_Thread_features" I thought it was clear. It's easy to add some more detail though. > > > +Implementation of the Shadow Stack > > +================================== > > + > > +Shadow Stack Size > > +----------------- > > + > > +A task's shadow stack is allocated from memory to a fixed size of > > +MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is > > allocated to > > +the maximum size of the normal stack, but capped to 4 GB. However, > > +a compat-mode application's address space is smaller, each of its > > thread's > > +shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB). > > This policy tries to handle all threads with the same shadow stack > size logic, which has limitations. I think it should be improved > (otherwise some applications will have to turn shstk off): > > - RLIMIT_STACK is not an upper bound for the main thread stack size > (rlimit can increase/decrease dynamically). > - RLIMIT_STACK only applies to the main thread, so it is not an upper > bound for non-main thread stacks. > - i.e. stack size >> startup RLIMIT_STACK is possible and then shadow > stack can overflow. > - stack size << startup RLIMIT_STACK is also possible and then VA > space is wasted (can lead to OOM with strict memory overcommit). > - clone3 tells the kernel the thread stack size so that should be > used instead of RLIMIT_STACK. (clone does not though.) This actually happens already. I can update the docs. > - I think it's better to have a new limit specifically for shadow > stack size (which by default can be RLIMIT_STACK) so userspace > can adjust it if needed (another reason is that stack size is > not always a good indicator of max call depth). Hmm, yea. This seems like a good idea, but I don't see why it can't be a follow on. The series is quite big just to get the basics. I have tried to save some of the enhancements (like alt shadow stack) for the future. > > > +Signal > > +------ > > + > > +By default, the main program and its signal handlers use the same > > shadow > > +stack. Because the shadow stack stores only return addresses, a > > large > > +shadow stack covers the condition that both the program stack and > > the > > +signal alternate stack run out. > > What does "by default" mean here? Is there a case when the signal > handler > is not entered with SSP set to the handling thread'd shadow stack? Ah, yea, that could be updated. It is in reference to an alt shadow stack implementation that was held for later. > > > +When a signal happens, the old pre-signal state is pushed on the > > stack. When > > +shadow stack is enabled, the shadow stack specific state is pushed > > onto the > > +shadow stack. Today this is only the old SSP (shadow stack > > pointer), pushed > > +in a special format with bit 63 set. On sigreturn this old SSP > > token is > > +verified and restored by the kernel. The kernel will also push the > > normal > > +restorer address to the shadow stack to help userspace avoid a > > shadow stack > > +violation on the sigreturn path that goes through the restorer. > > The kernel pushes on the shadow stack on signal entry so shadow stack > overflow cannot be handled. Please document this as non-recoverable > failure. It doesn't hurt to call it out. Please see the below link for future plans to handle this scenario (alt shadow stack). > > I think it can be made recoverable if signals with alternate stack > run > on a different shadow stack. And the top of the thread shadow stack > is > just corrupted instead of pushed in the overflow case. Then longjmp > out > can be made to work (common in stack overflow handling cases), and > reliable crash report from the signal handler works (also common). > > Does SSP get stored into the sigcontext struct somewhere? No, it's pushed to the shadow stack only. See the v2 coverletter of the discussion on the design and reasoning: https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/ > > > +Fork > > +---- > > + > > +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are > > required > > +to be read-only and dirty. When a shadow stack PTE is not RO and > > dirty, a > > +shadow access triggers a page fault with the shadow stack access > > bit set > > +in the page fault error code. > > + > > +When a task forks a child, its shadow stack PTEs are copied and > > both the > > +parent's and the child's shadow stack PTEs are cleared of the > > dirty bit. > > +Upon the next shadow stack access, the resulting shadow stack page > > fault > > +is handled by page copy/re-use. > > + > > +When a pthread child is created, the kernel allocates a new shadow > > stack > > +for the new thread. New shadow stack's behave like mmap() with > > respect to > > +ASLR behavior. > > Please document the shadow stack lifetimes here: > > I think thread exit unmaps shadow stack and vfork shares shadow stack > with parent so exit does not unmap. Sure, this can be updated. > > I think the map_shadow_stack syscall should be mentioned in this > document too. There is a man page prepared for this. I plan to update the docs to reference it when it exists and not duplicate the text. There can be a blurb for the time being but it would be short lived. > If one wants to scan the shadow stack how to detect the end (e.g. > fast > backtrace)? Is it useful to put an invalid value (-1) there? > (affects map_shadow_stack syscall too). Interesting idea. I think it's probably not a breaking ABI change if we wanted to add it later.
On Wed, 2023-03-01 at 10:07 -0800, Rick Edgecombe wrote: > > If one wants to scan the shadow stack how to detect the end (e.g. > > fast > > backtrace)? Is it useful to put an invalid value (-1) there? > > (affects map_shadow_stack syscall too). > > Interesting idea. I think it's probably not a breaking ABI change if > we > wanted to add it later. One complication could be how to handle shadow stacks created outside of thread creation. map_shadow_stack would typically add a token at the end so it could be pivoted to. So then the backtracing algorithm would have to know to skip it or something to find a special start of stack marker. Alternatively, the thread shadow stacks could get an already used token pushed at the end, to try to match what an in-use map_shadow_stack shadow stack would look like. Then the backtracing algorithm could just look for the same token in both cases. It might get confused in exotic cases and mistake a token in the middle of the stack for the end of the allocation though. Hmm...
The 03/01/2023 18:07, Edgecombe, Rick P wrote: > On Wed, 2023-03-01 at 14:21 +0000, Szabolcs Nagy wrote: > > The 02/27/2023 14:29, Rick Edgecombe wrote: > > > +Application Enabling > > > +==================== > > > + > > > +An application's CET capability is marked in its ELF note and can > > > be verified > > > +from readelf/llvm-readelf output:: > > > + > > > + readelf -n <application> | grep -a SHSTK > > > + properties: x86 feature: SHSTK > > > + > > > +The kernel does not process these applications markers directly. > > > Applications > > > +or loaders must enable CET features using the interface described > > > in section 4. > > > +Typically this would be done in dynamic loader or static runtime > > > objects, as is > > > +the case in GLIBC. > > > > Note that this has to be an early decision in libc (ld.so or static > > exe start code), which will be difficult to hook into system wide > > security policy settings. (e.g. to force shstk on marked binaries.) > > In the eager enabling (by the kernel) scenario, how is this improved? > The loader has to have the option to disable the shadow stack if > enabling conditions are not met, so it still has to trust userspace to > not do that. Did you have any more specifics on how the policy would > work? i guess my issue is that the arch prctls only allow self policing. there is no kernel mechanism to set policy from outside the process that is either inherited or asynchronously set. policy is completely managed by libc (and done very early). now i understand that async disable does not work (thanks for the explanation), but some control for forced enable/locking inherited across exec could work. > > From userspace POV I'd prefer if a static exe did not have to parse > > its own ELF notes (i.e. kernel enabled shstk based on the marking). > > This is actually exactly what happens in the glibc patches. My > understand was that it already been discussed amongst glibc folks. there were many glibc patches some of which are committed despite not having an accepted linux abi, so i'm trying to review the linux abi contracts and expect this patch to be authorative, please bear with me. > > - I think it's better to have a new limit specifically for shadow > > stack size (which by default can be RLIMIT_STACK) so userspace > > can adjust it if needed (another reason is that stack size is > > not always a good indicator of max call depth). > > Hmm, yea. This seems like a good idea, but I don't see why it can't be > a follow on. The series is quite big just to get the basics. I have > tried to save some of the enhancements (like alt shadow stack) for the > future. it is actually not obvious how to introduce a limit so it is inherited or reset in a sensible way so i think it is useful to discuss it together with other issues. > > The kernel pushes on the shadow stack on signal entry so shadow stack > > overflow cannot be handled. Please document this as non-recoverable > > failure. > > It doesn't hurt to call it out. Please see the below link for future > plans to handle this scenario (alt shadow stack). > > > > > I think it can be made recoverable if signals with alternate stack > > run > > on a different shadow stack. And the top of the thread shadow stack > > is > > just corrupted instead of pushed in the overflow case. Then longjmp > > out > > can be made to work (common in stack overflow handling cases), and > > reliable crash report from the signal handler works (also common). > > > > Does SSP get stored into the sigcontext struct somewhere? > > No, it's pushed to the shadow stack only. See the v2 coverletter of the > discussion on the design and reasoning: > > https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/ i think this should be part of the initial design as it may be hard to change later. "sigaltshstk() is separate from sigaltstack(). You can have one without the other, neither or both together. Because the shadow stack specific state is pushed to the shadow stack, the two features don’t need to know about each other." this means they cannot be changed together atomically. i'd expect most sigaltstack users to want to be resilient against shadow stack overflow which means non-portable code changes. i don't see why automatic alt shadow stack allocation would not work (kernel manages it transparently when an alt stack is installed or disabled). "Since shadow alt stacks are a new feature, longjmp()ing from an alt shadow stack will simply not be supported. If a libc want’s to support this it will need to enable WRSS and write it’s own restore token." i think longjmp should work without enabling writes to the shadow stack in the libc. this can also affect unwinding across signal handlers (not for c++ but e.g. glibc thread cancellation). i'd prefer overwriting the shadow stack top entry on overflow to disallowing longjmp out of a shadow stack overflow handler. > > I think the map_shadow_stack syscall should be mentioned in this > > document too. > > There is a man page prepared for this. I plan to update the docs to > reference it when it exists and not duplicate the text. There can be a > blurb for the time being but it would be short lived. i wanted to comment on the syscall because i think it may be better to have a magic mmap MAP_ flag that takes care of everything. but i can go comment on the specific patch then. thanks.
On Thu, 2023-03-02 at 16:14 +0000, szabolcs.nagy@arm.com wrote: > The 03/01/2023 18:07, Edgecombe, Rick P wrote: > > On Wed, 2023-03-01 at 14:21 +0000, Szabolcs Nagy wrote: > > > The 02/27/2023 14:29, Rick Edgecombe wrote: > > > > +Application Enabling > > > > +==================== > > > > + > > > > +An application's CET capability is marked in its ELF note and > > > > can > > > > be verified > > > > +from readelf/llvm-readelf output:: > > > > + > > > > + readelf -n <application> | grep -a SHSTK > > > > + properties: x86 feature: SHSTK > > > > + > > > > +The kernel does not process these applications markers > > > > directly. > > > > Applications > > > > +or loaders must enable CET features using the interface > > > > described > > > > in section 4. > > > > +Typically this would be done in dynamic loader or static > > > > runtime > > > > objects, as is > > > > +the case in GLIBC. > > > > > > Note that this has to be an early decision in libc (ld.so or > > > static > > > exe start code), which will be difficult to hook into system wide > > > security policy settings. (e.g. to force shstk on marked > > > binaries.) > > > > In the eager enabling (by the kernel) scenario, how is this > > improved? > > The loader has to have the option to disable the shadow stack if > > enabling conditions are not met, so it still has to trust userspace > > to > > not do that. Did you have any more specifics on how the policy > > would > > work? > > i guess my issue is that the arch prctls only allow self policing. > there is no kernel mechanism to set policy from outside the process > that is either inherited or asynchronously set. policy is completely > managed by libc (and done very early). > > now i understand that async disable does not work (thanks for the > explanation), but some control for forced enable/locking inherited > across exec could work. Is the idea that shadow stack would be forced on regardless of if the linked libraries support it? In which case it could be allowed to crash if they do not? I think the majority of users would prefer the other case where shadow stack is only used if supported, so this sounds like a special case. Rather than lose the flexibility for the typical case, I would think something like this could be an additional enabling mode. glibc could check if shadow stack is already enabled by the kernel using the arch_prctl()s in this case. We are having to work around the existing broken glibc binaries by not triggering off the elf bits automatically in the kernel, but I suppose if this was a special "I don't care if it crashes" feature, maybe it would be ok. Otherwise we would need to change the elf header bit to exclude the old binaries to even be able to do this, and there was extreme resistance to this idea from the userspace side. > > > > From userspace POV I'd prefer if a static exe did not have to > > > parse > > > its own ELF notes (i.e. kernel enabled shstk based on the > > > marking). > > > > This is actually exactly what happens in the glibc patches. My > > understand was that it already been discussed amongst glibc folks. > > there were many glibc patches some of which are committed despite not > having an accepted linux abi, so i'm trying to review the linux abi > contracts and expect this patch to be authorative, please bear with > me. H.J. has some recent ones that work against this kernel series that might interest you. The existing upstream glibc support will not get used due to the enabling interface change to arch_prctl() (this was one of the inspirations of the change actually). > > > > - I think it's better to have a new limit specifically for shadow > > > stack size (which by default can be RLIMIT_STACK) so userspace > > > can adjust it if needed (another reason is that stack size is > > > not always a good indicator of max call depth). > > > > Hmm, yea. This seems like a good idea, but I don't see why it can't > > be > > a follow on. The series is quite big just to get the basics. I have > > tried to save some of the enhancements (like alt shadow stack) for > > the > > future. > > it is actually not obvious how to introduce a limit so it is > inherited > or reset in a sensible way so i think it is useful to discuss it > together with other issues. Looking at this again, I'm not sure why a new rlimit is needed. It seems many of those points were just formulations of that the clone3 stack size was not used, but it actually is and just not documented. If you disagree perhaps you could elaborate on what the requirements are and we can see if it seems tricky to do in a follow up. > > > > The kernel pushes on the shadow stack on signal entry so shadow > > > stack > > > overflow cannot be handled. Please document this as non- > > > recoverable > > > failure. > > > > It doesn't hurt to call it out. Please see the below link for > > future > > plans to handle this scenario (alt shadow stack). > > > > > > > > I think it can be made recoverable if signals with alternate > > > stack > > > run > > > on a different shadow stack. And the top of the thread shadow > > > stack > > > is > > > just corrupted instead of pushed in the overflow case. Then > > > longjmp > > > out > > > can be made to work (common in stack overflow handling cases), > > > and > > > reliable crash report from the signal handler works (also > > > common). > > > > > > Does SSP get stored into the sigcontext struct somewhere? > > > > No, it's pushed to the shadow stack only. See the v2 coverletter of > > the > > discussion on the design and reasoning: > > > > https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/ > > i think this should be part of the initial design as it may be hard > to change later. This is actually how it came up. Andy Lutomirski said, paraphrasing, "what if we want alt shadow stacks someday, does the signal frame ABI support it?". So I created an ABI that supports it and an initial POC, and said lets hold off on the implementation for the first version and just use the sigframe ABI that will allow it for the future. So the point was to make sure the signal format supported alt shadow stacks to make it easier in the future. > > "sigaltshstk() is separate from sigaltstack(). You can have one > without the other, neither or both together. Because the shadow > stack specific state is pushed to the shadow stack, the two > features don’t need to know about each other." > > this means they cannot be changed together atomically. Not sure why this is needed since they can be used separately. So why tie them together? > > i'd expect most sigaltstack users to want to be resilient > against shadow stack overflow which means non-portable > code changes. Portable between architectures? Or between shadow stack vs non-shadow stack? It does seem like it would not be uncommon for users to want both together, but see below. > > i don't see why automatic alt shadow stack allocation would > not work (kernel manages it transparently when an alt stack > is installed or disabled). Ah, I think I see where maybe I can fill you in. Andy Luto had discounted this idea out of hand originally, but I didn't see it at first. sigaltstack lets you set, retrieve, or disable the shadow stack, right... But this doesn't allocate anything, it just sets where the next signal will be handled. This is different than things like threads where there is a new resources being allocated and it makes coming up with logic to guess when to de-allocate the alt shadow stack difficult. You probably already know... But because of this there can be some modes where the shadow stack is changed while on it. For one example, SS_AUTODISARM will disable the alt shadow stack while switching to it and restore when sigreturning. At which point a new altstack can be set. In the non-shadow stack case this is nice because future signals won't clobber the alt stack if you switch away from it (swapcontext(), etc). But it also means you can "change" the alt stack while on it ("change" sort of, the auto disarm results in the kernel forgetting it temporarily). I hear where you are coming from with the desire to have it "just work" with existing code, but I think the resulting ABI around the alt shadow stack allocation lifecycle would be way too complicated even if it could be made to work. Hence making a new interface. But also, the idea was that the x86 signal ABI should support handling alt shadow stacks, which is what we have done with this series. If a different interface for configuring it is better than the one from the POC, I'm not seeing a problem jump out. Is there any specific concern about backwards compatibility here? > > "Since shadow alt stacks are a new feature, longjmp()ing from an > alt shadow stack will simply not be supported. If a libc want’s > to support this it will need to enable WRSS and write it’s own > restore token." > > i think longjmp should work without enabling writes to the shadow > stack in the libc. this can also affect unwinding across signal > handlers (not for c++ but e.g. glibc thread cancellation). glibc today does not support longjmp()ing from a different stack (for example even today after a swapcontext()) when shadow stack is used. If glibc used wrss it could be supported maybe, but otherwise I don't see how the HW can support it. HJ and I were actually just discussing this the other day. Are you looking at this series with respect to the arm shadow stack feature by any chance? I would love if glibc/tools would document what the shadow stack limitations are. If the all the arch's have the same or similar limitations perhaps this could be one developer guide. For the most part though, the limitations I've encountered are in glibc and the kernel is more the building blocks. > > i'd prefer overwriting the shadow stack top entry on overflow to > disallowing longjmp out of a shadow stack overflow handler. > > > > I think the map_shadow_stack syscall should be mentioned in this > > > document too. > > > > There is a man page prepared for this. I plan to update the docs to > > reference it when it exists and not duplicate the text. There can > > be a > > blurb for the time being but it would be short lived. > > i wanted to comment on the syscall because i think it may be better > to have a magic mmap MAP_ flag that takes care of everything. > > but i can go comment on the specific patch then. > > thanks. A general comment. Not sure if you are aware, but this shadow stack enabling effort is quite old at this point and there have been many discussions on these topics stretching back years. The latest conversation was around getting this series into linux-next soon to get some testing on the MM pieces. I really appreciate getting this ABI feedback as it is always tricky to get right, but at this stage I would hope to be focusing mostly on concrete problems. I also expect to have some amount of ABI growth going forward with all the normal things that entails. Shadow stack is not special in that it can come fully finalized without the need for the real world usage iterative feedback process. At some point we need to move forward with something, and we have quite a bit of initial changes at this point. So I would like to minimize the initial implementation unless anyone sees any likely problems with future growth. Can you be clear if you see any concrete problems at this point or are more looking to evaluate the design reasoning? I'm under the assumption there is nothing that would prohibit linux-next testing while any ABI shakedown happens concurrently at least? Thanks, Rick
The 03/02/2023 21:17, Edgecombe, Rick P wrote: > Is the idea that shadow stack would be forced on regardless of if the > linked libraries support it? In which case it could be allowed to crash > if they do not? execute a binary - with shstk enabled and locked (only if marked?). - with shstk disabled and locked. could be managed in userspace, but it is libc dependent then. > > > > - I think it's better to have a new limit specifically for shadow > > > > stack size (which by default can be RLIMIT_STACK) so userspace > > > > can adjust it if needed (another reason is that stack size is > > > > not always a good indicator of max call depth). > > Looking at this again, I'm not sure why a new rlimit is needed. It > seems many of those points were just formulations of that the clone3 > stack size was not used, but it actually is and just not documented. If > you disagree perhaps you could elaborate on what the requirements are > and we can see if it seems tricky to do in a follow up. - tiny thread stack and deep signal stack. (note that this does not really work with glibc because it has implementation internal signals that don't run on alt stack, cannot be masked and don't fit on a tiny thread stack, but with other runtimes this can be a valid use-case, e.g. musl allows tiny thread stacks, < pagesize.) - thread runtimes with clone (glibc uses clone3 but some dont). - huge stacks but small call depth (problem if some va limit is hit or memory overcommit is disabled). > > "sigaltshstk() is separate from sigaltstack(). You can have one > > without the other, neither or both together. Because the shadow > > stack specific state is pushed to the shadow stack, the two > > features don’t need to know about each other." ... > > i don't see why automatic alt shadow stack allocation would > > not work (kernel manages it transparently when an alt stack > > is installed or disabled). > > Ah, I think I see where maybe I can fill you in. Andy Luto had > discounted this idea out of hand originally, but I didn't see it at > first. sigaltstack lets you set, retrieve, or disable the shadow stack, > right... But this doesn't allocate anything, it just sets where the > next signal will be handled. This is different than things like threads > where there is a new resources being allocated and it makes coming up > with logic to guess when to de-allocate the alt shadow stack difficult. > You probably already know... > > But because of this there can be some modes where the shadow stack is > changed while on it. For one example, SS_AUTODISARM will disable the > alt shadow stack while switching to it and restore when sigreturning. > At which point a new altstack can be set. In the non-shadow stack case > this is nice because future signals won't clobber the alt stack if you > switch away from it (swapcontext(), etc). But it also means you can > "change" the alt stack while on it ("change" sort of, the auto disarm > results in the kernel forgetting it temporarily). the problem with swapcontext is that it may unmask signals that run on the alt stack, which means the code cannot jump back after another signal clobbered the alt stack. the non-standard SS_AUTODISARM aims to solve this by disabling alt stack settings on signal entry until the handler returns. so this use case is not about supporting swapcontext out, but about jumping back. however that does not work reliably with this patchset: if swapcontext goes to the thread stack (and not to another stack e.g. used by makecontext), then jump back fails. (and if there is a sigaltshstk installed then even jump out fails.) assuming - jump out from alt shadow stack can be made to work. - alt shadow stack management can be automatic. then this can be improved so jump back works reliably. > I hear where you are coming from with the desire to have it "just work" > with existing code, but I think the resulting ABI around the alt shadow > stack allocation lifecycle would be way too complicated even if it > could be made to work. Hence making a new interface. But also, the idea > was that the x86 signal ABI should support handling alt shadow stacks, > which is what we have done with this series. If a different interface > for configuring it is better than the one from the POC, I'm not seeing > a problem jump out. Is there any specific concern about backwards > compatibility here? sigaltstack syscall behaviour may be hard to change later and currently - shadow stack overflow cannot be recovered from. - longjmp out of signal handler fails (with sigaltshstk). - SS_AUTODISARM does not work (jump back can fail). > > "Since shadow alt stacks are a new feature, longjmp()ing from an > > alt shadow stack will simply not be supported. If a libc want’s > > to support this it will need to enable WRSS and write it’s own > > restore token." > > > > i think longjmp should work without enabling writes to the shadow > > stack in the libc. this can also affect unwinding across signal > > handlers (not for c++ but e.g. glibc thread cancellation). > > glibc today does not support longjmp()ing from a different stack (for > example even today after a swapcontext()) when shadow stack is used. If > glibc used wrss it could be supported maybe, but otherwise I don't see > how the HW can support it. > > HJ and I were actually just discussing this the other day. Are you > looking at this series with respect to the arm shadow stack feature by > any chance? I would love if glibc/tools would document what the shadow > stack limitations are. If the all the arch's have the same or similar > limitations perhaps this could be one developer guide. For the most > part though, the limitations I've encountered are in glibc and the > kernel is more the building blocks. well we hope that shadow stack behaviour and limitations can be similar across targets. longjmp to different stack should work: it can do the same as setcontext/swapcontext: scan for the pivot token. then only longjmp out of alt shadow stack fails. (this is non-conforming longjmp use, but e.g. qemu relies on it.) for longjmp out of alt shadow stack, the target shadow stack needs a pivot token, which implies the kernel needs to push that on signal entry, which can overflow. but i suspect that can be handled the same way as stackoverflow on signal entry is handled. > A general comment. Not sure if you are aware, but this shadow stack > enabling effort is quite old at this point and there have been many > discussions on these topics stretching back years. The latest > conversation was around getting this series into linux-next soon to get > some testing on the MM pieces. I really appreciate getting this ABI > feedback as it is always tricky to get right, but at this stage I would > hope to be focusing mostly on concrete problems. > > I also expect to have some amount of ABI growth going forward with all > the normal things that entails. Shadow stack is not special in that it > can come fully finalized without the need for the real world usage > iterative feedback process. At some point we need to move forward with > something, and we have quite a bit of initial changes at this point. > > So I would like to minimize the initial implementation unless anyone > sees any likely problems with future growth. Can you be clear if you > see any concrete problems at this point or are more looking to evaluate > the design reasoning? I'm under the assumption there is nothing that > would prohibit linux-next testing while any ABI shakedown happens > concurrently at least? understood. the points that i think are worth raising: - shadow stack size logic may need to change later. (it can be too big, or too small in practice.) - shadow stack overflow is not recoverable and the possible fix for that (sigaltshstk) breaks longjmp out of signal handlers. - jump back after SS_AUTODISARM swapcontext cannot be reliable if alt signal uses thread shadow stack. - the above two concerns may be mitigated by different sigaltstack behaviour which may be hard to add later. - end token for backtrace may be useful, if added later it can be hard to check. thanks.
On Fri, Mar 3, 2023 at 8:31 AM szabolcs.nagy@arm.com <szabolcs.nagy@arm.com> wrote: > > The 03/02/2023 21:17, Edgecombe, Rick P wrote: > > Is the idea that shadow stack would be forced on regardless of if the > > linked libraries support it? In which case it could be allowed to crash > > if they do not? > > execute a binary > - with shstk enabled and locked (only if marked?). > - with shstk disabled and locked. > could be managed in userspace, but it is libc dependent then. > > > > > > - I think it's better to have a new limit specifically for shadow > > > > > stack size (which by default can be RLIMIT_STACK) so userspace > > > > > can adjust it if needed (another reason is that stack size is > > > > > not always a good indicator of max call depth). > > > > Looking at this again, I'm not sure why a new rlimit is needed. It > > seems many of those points were just formulations of that the clone3 > > stack size was not used, but it actually is and just not documented. If > > you disagree perhaps you could elaborate on what the requirements are > > and we can see if it seems tricky to do in a follow up. > > - tiny thread stack and deep signal stack. > (note that this does not really work with glibc because it has > implementation internal signals that don't run on alt stack, > cannot be masked and don't fit on a tiny thread stack, but > with other runtimes this can be a valid use-case, e.g. musl > allows tiny thread stacks, < pagesize.) > > - thread runtimes with clone (glibc uses clone3 but some dont). > > - huge stacks but small call depth (problem if some va limit > is hit or memory overcommit is disabled). > > > > "sigaltshstk() is separate from sigaltstack(). You can have one > > > without the other, neither or both together. Because the shadow > > > stack specific state is pushed to the shadow stack, the two > > > features don’t need to know about each other." > ... > > > i don't see why automatic alt shadow stack allocation would > > > not work (kernel manages it transparently when an alt stack > > > is installed or disabled). > > > > Ah, I think I see where maybe I can fill you in. Andy Luto had > > discounted this idea out of hand originally, but I didn't see it at > > first. sigaltstack lets you set, retrieve, or disable the shadow stack, > > right... But this doesn't allocate anything, it just sets where the > > next signal will be handled. This is different than things like threads > > where there is a new resources being allocated and it makes coming up > > with logic to guess when to de-allocate the alt shadow stack difficult. > > You probably already know... > > > > But because of this there can be some modes where the shadow stack is > > changed while on it. For one example, SS_AUTODISARM will disable the > > alt shadow stack while switching to it and restore when sigreturning. > > At which point a new altstack can be set. In the non-shadow stack case > > this is nice because future signals won't clobber the alt stack if you > > switch away from it (swapcontext(), etc). But it also means you can > > "change" the alt stack while on it ("change" sort of, the auto disarm > > results in the kernel forgetting it temporarily). > > the problem with swapcontext is that it may unmask signals > that run on the alt stack, which means the code cannot jump > back after another signal clobbered the alt stack. > > the non-standard SS_AUTODISARM aims to solve this by disabling > alt stack settings on signal entry until the handler returns. > > so this use case is not about supporting swapcontext out, but > about jumping back. however that does not work reliably with > this patchset: if swapcontext goes to the thread stack (and > not to another stack e.g. used by makecontext), then jump back > fails. (and if there is a sigaltshstk installed then even jump > out fails.) > > assuming > - jump out from alt shadow stack can be made to work. > - alt shadow stack management can be automatic. > then this can be improved so jump back works reliably. > > > I hear where you are coming from with the desire to have it "just work" > > with existing code, but I think the resulting ABI around the alt shadow > > stack allocation lifecycle would be way too complicated even if it > > could be made to work. Hence making a new interface. But also, the idea > > was that the x86 signal ABI should support handling alt shadow stacks, > > which is what we have done with this series. If a different interface > > for configuring it is better than the one from the POC, I'm not seeing > > a problem jump out. Is there any specific concern about backwards > > compatibility here? > > sigaltstack syscall behaviour may be hard to change later > and currently > - shadow stack overflow cannot be recovered from. > - longjmp out of signal handler fails (with sigaltshstk). > - SS_AUTODISARM does not work (jump back can fail). > > > > "Since shadow alt stacks are a new feature, longjmp()ing from an > > > alt shadow stack will simply not be supported. If a libc want’s > > > to support this it will need to enable WRSS and write it’s own > > > restore token." > > > > > > i think longjmp should work without enabling writes to the shadow > > > stack in the libc. this can also affect unwinding across signal > > > handlers (not for c++ but e.g. glibc thread cancellation). > > > > glibc today does not support longjmp()ing from a different stack (for > > example even today after a swapcontext()) when shadow stack is used. If > > glibc used wrss it could be supported maybe, but otherwise I don't see > > how the HW can support it. > > > > HJ and I were actually just discussing this the other day. Are you > > looking at this series with respect to the arm shadow stack feature by > > any chance? I would love if glibc/tools would document what the shadow > > stack limitations are. If the all the arch's have the same or similar > > limitations perhaps this could be one developer guide. For the most > > part though, the limitations I've encountered are in glibc and the > > kernel is more the building blocks. > > well we hope that shadow stack behaviour and limitations can > be similar across targets. > > longjmp to different stack should work: it can do the same as > setcontext/swapcontext: scan for the pivot token. then only > longjmp out of alt shadow stack fails. (this is non-conforming > longjmp use, but e.g. qemu relies on it.) Restore token may not be used with longjmp. Unlike setcontext/swapcontext, longjmp is optional. If longjmp isn't called, there will be an extra token on shadow stack and RET will fail. > for longjmp out of alt shadow stack, the target shadow stack > needs a pivot token, which implies the kernel needs to push that > on signal entry, which can overflow. but i suspect that can be > handled the same way as stackoverflow on signal entry is handled. > > > A general comment. Not sure if you are aware, but this shadow stack > > enabling effort is quite old at this point and there have been many > > discussions on these topics stretching back years. The latest > > conversation was around getting this series into linux-next soon to get > > some testing on the MM pieces. I really appreciate getting this ABI > > feedback as it is always tricky to get right, but at this stage I would > > hope to be focusing mostly on concrete problems. > > > > I also expect to have some amount of ABI growth going forward with all > > the normal things that entails. Shadow stack is not special in that it > > can come fully finalized without the need for the real world usage > > iterative feedback process. At some point we need to move forward with > > something, and we have quite a bit of initial changes at this point. > > > > So I would like to minimize the initial implementation unless anyone > > sees any likely problems with future growth. Can you be clear if you > > see any concrete problems at this point or are more looking to evaluate > > the design reasoning? I'm under the assumption there is nothing that > > would prohibit linux-next testing while any ABI shakedown happens > > concurrently at least? > > understood. > > the points that i think are worth raising: > > - shadow stack size logic may need to change later. > (it can be too big, or too small in practice.) > - shadow stack overflow is not recoverable and the > possible fix for that (sigaltshstk) breaks longjmp > out of signal handlers. > - jump back after SS_AUTODISARM swapcontext cannot be > reliable if alt signal uses thread shadow stack. > - the above two concerns may be mitigated by different > sigaltstack behaviour which may be hard to add later. > - end token for backtrace may be useful, if added > later it can be hard to check. > > thanks.
The 03/03/2023 08:57, H.J. Lu wrote: > On Fri, Mar 3, 2023 at 8:31 AM szabolcs.nagy@arm.com > <szabolcs.nagy@arm.com> wrote: > > longjmp to different stack should work: it can do the same as > > setcontext/swapcontext: scan for the pivot token. then only > > longjmp out of alt shadow stack fails. (this is non-conforming > > longjmp use, but e.g. qemu relies on it.) > > Restore token may not be used with longjmp. Unlike setcontext/swapcontext, > longjmp is optional. If longjmp isn't called, there will be an extra > token on shadow > stack and RET will fail. what do you mean longjmp is optional? it can scan the target shadow stack and decide if it's the same as the current one or not and in the latter case there should be a restore token to switch to. then it can INCSSP to reach the target SSP state. qemu does setjmp, then swapcontext, then longjmp back. swapcontext can change the stack, but leaves a token behind so longjmp can switch back.
On Fri, 2023-03-03 at 16:30 +0000, szabolcs.nagy@arm.com wrote: > the points that i think are worth raising: > > - shadow stack size logic may need to change later. > (it can be too big, or too small in practice.) Looking at making it more efficient in the future seems great. But since we are not in the position of being able to make shadow stacks completely seamless (see below) > - shadow stack overflow is not recoverable and the > possible fix for that (sigaltshstk) breaks longjmp > out of signal handlers. > - jump back after SS_AUTODISARM swapcontext cannot be > reliable if alt signal uses thread shadow stack. > - the above two concerns may be mitigated by different > sigaltstack behaviour which may be hard to add later. Are you aware that you can't simply emit a restore token on x86 without first restoring to another restore token? This is why (I'm assuming) glibc uses incssp to implement longjmp instead of just jumping back to the setjmp point with a shadow stack restore. So of course then longjmp can't jump between shadow stacks. So there are sort of two categories of restrictions on binaries that mark the SHSTK elf bit. The first category is that they have to take special steps when switching stacks or jumping around on the stack. Once they handle this, they can work with shadow stack. The second category is that they can't do certain patterns of jumping around on stacks, regardless of the steps they take. So certain previously allowed software patterns are now impossible, including ones implemented in glibc. (And the exact restrictions on the glibc APIs are not documented and this should be fixed). If applications will violate either type of these restrictions they should not mark the SHSTK elf bit. Now that said, there is an exception to these restrictions on x86, which is the WRSS instruction, which can write to the shadow stack. The arch_prctl() interface allows this to be optionally enabled and locked. The v2 signal analysis I pointed earlier, mentions how this might be used by glibc to support more of the currently restricted patterns. Please take a look if you haven't (section "setjmp()/longjmp()"). It also explains why in the non-WRSS scenarios the kernel can't easily help improve the situation. WRSS opens up writing to the shadow stack, and so a glibc-WRSS mode would be making a security/compatibility tradeoff. I think starting with the more restricted mode was ultimately good in creating a kernel ABI that can support both. If userspace could paper over ABI gaps with WRSS, we might not have realized the issues we did. > - end token for backtrace may be useful, if added > later it can be hard to check. Yes this seems like a good idea. Thanks for the suggestion. I'm not sure it can't be added later though. I'll POC it and do some more thinking.
On Fri, Mar 3, 2023 at 9:40 AM szabolcs.nagy@arm.com <szabolcs.nagy@arm.com> wrote: > > The 03/03/2023 08:57, H.J. Lu wrote: > > On Fri, Mar 3, 2023 at 8:31 AM szabolcs.nagy@arm.com > > <szabolcs.nagy@arm.com> wrote: > > > longjmp to different stack should work: it can do the same as > > > setcontext/swapcontext: scan for the pivot token. then only > > > longjmp out of alt shadow stack fails. (this is non-conforming > > > longjmp use, but e.g. qemu relies on it.) > > > > Restore token may not be used with longjmp. Unlike setcontext/swapcontext, > > longjmp is optional. If longjmp isn't called, there will be an extra > > token on shadow > > stack and RET will fail. > > what do you mean longjmp is optional? In some cases, longjmp is called to handle an error condition and longjmp won't be called if there is no error. > it can scan the target shadow stack and decide if it's the > same as the current one or not and in the latter case there > should be a restore token to switch to. then it can INCSSP > to reach the target SSP state. > > qemu does setjmp, then swapcontext, then longjmp back. > swapcontext can change the stack, but leaves a token behind > so longjmp can switch back. This needs changes to support shadow stack. Replacing setjmp with getcontext and longjmp with setcontext may work for shadow stack. BTW, there is no testcase in glibc for this usage.
On Thu, 2023-03-02 at 16:34 +0000, szabolcs.nagy@arm.com wrote: > > Alternatively, the thread shadow stacks could get an already used > > token > > pushed at the end, to try to match what an in-use map_shadow_stack > > shadow stack would look like. Then the backtracing algorithm could > > just > > look for the same token in both cases. It might get confused in > > exotic > > cases and mistake a token in the middle of the stack for the end of > > the > > allocation though. Hmm... > > a backtracer would search for an end token on an active shadow > stack. it should be able to skip other tokens that don't seem > to be code addresses. the end token needs to be identifiable > and not break security properties. i think it's enough if the > backtrace is best effort correct, there can be corner-cases when > shadow stack is difficult to interpret, but e.g. a profiler can > still make good use of this feature. So just taking a look at this and remembering we used to have an arch_prctl() that returned the thread's shadow stack base and size. Glibc needed it, but we found a way around and dropped it. If we added something like that back, then it could be used for backtracing in the typical thread case and also potentially similar things to what glibc was doing. This also saves ~8 bytes per shadow stack over an end-of- stack marker, so it's a tiny bit better on memory use. For the end-of-stack-marker solution: In the case of thread shadow stacks, I'm not seeing any issues testing adding markers at the end. So adding this on top of the existing series for just thread shadow stacks seems lower probability of impact regression wise. Especially if we do it in the near term. For ucontext/map_shadow_stack, glibc expects a token to be at the size passed in. So we would either have to create a larger allocation (to include the marker) or create a new map_shadow_stack flag to do this (it was expected that there might be new types of initial shadow stack data that the kernel might need to create). It is also possible to pass a non-page aligned size and get zero's at the end of the allocation. In fact glibc does this today in the common case. So that is also an option. I think I slightly prefer the former arch_prctl() based solution for a few reasons: - When you need to find the start or end of the shadow stack can you can just ask for it instead of searching. It can be faster and simpler. - It saves 8 bytes of memory per shadow stack. If this turns out to be wrong and we want to do the marker solution much later at some point, the safest option would probably be to create new flags. But just discussing this with HJ, can you share more on what the usage is? Like which backtracing operation specifically needs the marker? How much does it care about the ucontext case?
* szabolcs: > syscall overhead in case of frequent stack trace collection can be > avoided by caching (in tls) when ssp falls within the thread shadow > stack bounds. otherwise caching does not work as the shadow stack may > be reused (alt shadow stack or ucontext case). Do we need to perform the system call at each page boundary only? That should reduce overhead to the degree that it should not matter. > unfortunately i don't know if syscall overhead is actually a problem > (probably not) or if backtrace across signal handlers need to work > with alt shadow stack (i guess it should work for crash reporting). Ideally, we would implement the backtrace function (in glibc) as just a shadow stack copy. But this needs to follow the chain of alternate stacks, and it may also need some form of markup for signal handler frames (which need program counter adjustment to reflect that a *non-signal* frame is conceptually nested within the previous instruction, and not the function the return address points to). But I think we can add support for this incrementally. I assume there is no desire at all on the kernel side that sigaltstack transparently allocates the shadow stack? Because there is no deallocation function today for sigaltstack? Thanks, Florian
+Kan for shadow stack perf discussion. On Mon, 2023-03-06 at 16:20 +0000, szabolcs.nagy@arm.com wrote: > The 03/03/2023 22:35, Edgecombe, Rick P wrote: > > I think I slightly prefer the former arch_prctl() based solution > > for a > > few reasons: > > - When you need to find the start or end of the shadow stack can > > you > > can just ask for it instead of searching. It can be faster and > > simpler. > > - It saves 8 bytes of memory per shadow stack. > > > > If this turns out to be wrong and we want to do the marker solution > > much later at some point, the safest option would probably be to > > create > > new flags. > > i see two problems with a get bounds syscall: > > - syscall overhead. > > - discontinous shadow stack (e.g. alt shadow stack ends with a > pointer to the interrupted thread shadow stack, so stack trace > can continue there, except you don't know the bounds of that). > > > But just discussing this with HJ, can you share more on what the > > usage > > is? Like which backtracing operation specifically needs the marker? > > How > > much does it care about the ucontext case? > > it could be an option for perf or ptracers to sample the stack trace. > > in-process collection of stack trace for profiling or crash reporting > (e.g. when stack is corrupted) or cross checking stack integrity may > use it too. > > sometimes parsing /proc/self/smaps maybe enough, but the idea was to > enable light-weight backtrace collection in an async-signal-safe way. > > syscall overhead in case of frequent stack trace collection can be > avoided by caching (in tls) when ssp falls within the thread shadow > stack bounds. otherwise caching does not work as the shadow stack may > be reused (alt shadow stack or ucontext case). > > unfortunately i don't know if syscall overhead is actually a problem > (probably not) or if backtrace across signal handlers need to work > with alt shadow stack (i guess it should work for crash reporting). There was a POC done of perf integration. I'm not too knowledgeable on perf, but the patch itself didn't need any new shadow stack bounds ABI. Since it was implemented in the kernel, it could just refer to the kernel's internal data for the thread's shadow stack bounds. I asked about ucontext (similar to alt shadow stacks in regards to lack of bounds ABI), and apparently perf usually focuses on the thread stacks. Hopefully Kan can lend some more confidence to that assertion.
On Mon, 2023-03-06 at 17:31 +0100, Florian Weimer wrote: > Ideally, we would implement the backtrace function (in glibc) as just > a > shadow stack copy. But this needs to follow the chain of alternate > stacks, and it may also need some form of markup for signal handler > frames (which need program counter adjustment to reflect that a > *non-signal* frame is conceptually nested within the previous > instruction, and not the function the return address points to). In the alt shadow stack case, the shadow stack sigframe will have a special shadow stack frame with a pointer to the shadow stack stack it came from. This may be a thread stack, or some other stack. This writeup in the v2 of the series has more details and analysis on the signal piece: https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/ So in that design, you should be able to backtrace out of a chain of alt stacks. > But I > think we can add support for this incrementally. Yea, I think so too. > > I assume there is no desire at all on the kernel side that > sigaltstack > transparently allocates the shadow stack? It could have some nice benefit for some apps, so I did look into it. > Because there is no > deallocation function today for sigaltstack? Yea, this is why we can't do it transparently. There was some discussion up the thread on this.
On 2023-03-06 1:05 p.m., Edgecombe, Rick P wrote: > +Kan for shadow stack perf discussion. > > On Mon, 2023-03-06 at 16:20 +0000, szabolcs.nagy@arm.com wrote: >> The 03/03/2023 22:35, Edgecombe, Rick P wrote: >>> I think I slightly prefer the former arch_prctl() based solution >>> for a >>> few reasons: >>> - When you need to find the start or end of the shadow stack can >>> you >>> can just ask for it instead of searching. It can be faster and >>> simpler. >>> - It saves 8 bytes of memory per shadow stack. >>> >>> If this turns out to be wrong and we want to do the marker solution >>> much later at some point, the safest option would probably be to >>> create >>> new flags. >> >> i see two problems with a get bounds syscall: >> >> - syscall overhead. >> >> - discontinous shadow stack (e.g. alt shadow stack ends with a >> pointer to the interrupted thread shadow stack, so stack trace >> can continue there, except you don't know the bounds of that). >> >>> But just discussing this with HJ, can you share more on what the >>> usage >>> is? Like which backtracing operation specifically needs the marker? >>> How >>> much does it care about the ucontext case? >> >> it could be an option for perf or ptracers to sample the stack trace. >> >> in-process collection of stack trace for profiling or crash reporting >> (e.g. when stack is corrupted) or cross checking stack integrity may >> use it too. >> >> sometimes parsing /proc/self/smaps maybe enough, but the idea was to >> enable light-weight backtrace collection in an async-signal-safe way. >> >> syscall overhead in case of frequent stack trace collection can be >> avoided by caching (in tls) when ssp falls within the thread shadow >> stack bounds. otherwise caching does not work as the shadow stack may >> be reused (alt shadow stack or ucontext case). >> >> unfortunately i don't know if syscall overhead is actually a problem >> (probably not) or if backtrace across signal handlers need to work >> with alt shadow stack (i guess it should work for crash reporting). > > There was a POC done of perf integration. I'm not too knowledgeable on > perf, but the patch itself didn't need any new shadow stack bounds ABI. > Since it was implemented in the kernel, it could just refer to the > kernel's internal data for the thread's shadow stack bounds. > > I asked about ucontext (similar to alt shadow stacks in regards to lack > of bounds ABI), and apparently perf usually focuses on the thread > stacks. Hopefully Kan can lend some more confidence to that assertion. The POC perf patch I implemented tries to use the shadow stack to replace the frame pointer to construct a callchain of a user space thread. Yes, it's in the kernel, perf_callchain_user(). I don't think the current X86 perf implementation handle the alt stack either. So the kernel internal data for the thread's shadow stack bounds should be good enough for the perf case. Thanks, Kan
The 03/06/2023 18:08, Edgecombe, Rick P wrote: > On Mon, 2023-03-06 at 17:31 +0100, Florian Weimer wrote: > > I assume there is no desire at all on the kernel side that > > sigaltstack > > transparently allocates the shadow stack? > > It could have some nice benefit for some apps, so I did look into it. > > > Because there is no > > deallocation function today for sigaltstack? > > Yea, this is why we can't do it transparently. There was some > discussion up the thread on this. changing/disabling the alt stack is not valid while a handler is executing on it. if we don't allow jumping out and back to an alt stack (swapcontext) then there can be only one alt stack live per thread and change/disable can do the shadow stack free. if jump back is allowed (linux even makes it race-free with SS_AUTODISARM) then the life-time of alt stack is extended beyond change/disable (jump back to an unregistered alt stack). to support jump back to an alt stack the requirements are 1) user has to manage an alt shadow stack together with the alt stack (requies user code change, not just libc). 2) kernel has to push a restore token on the thread shadow stack on signal entry (at least in case of alt shadow stack, and deal with corner cases around shadow stack overflow).
* szabolcs: > changing/disabling the alt stack is not valid while a handler is > executing on it. if we don't allow jumping out and back to an > alt stack (swapcontext) then there can be only one alt stack > live per thread and change/disable can do the shadow stack free. > > if jump back is allowed (linux even makes it race-free with > SS_AUTODISARM) then the life-time of alt stack is extended > beyond change/disable (jump back to an unregistered alt stack). > > to support jump back to an alt stack the requirements are > > 1) user has to manage an alt shadow stack together with the alt > stack (requies user code change, not just libc). > > 2) kernel has to push a restore token on the thread shadow stack > on signal entry (at least in case of alt shadow stack, and > deal with corner cases around shadow stack overflow). We need to have a story for stackful coroutine switching as well, not just for sigaltstack. I hope that we can use OpenJDK (Project Loom) and QEMU as guinea pigs. If we have something that works for both, hopefully that covers a broad range of scenarios. Userspace coordination can eventually be handled by glibc; we can deallocate alternate stacks on thread exit fairly easily (at least compared to the current stack 8-). Thanks, Florian
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index c73d133fd37c..8ac64d7de4dc 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -22,6 +22,7 @@ x86-specific Documentation mtrr pat intel-hfi + shstk iommu intel_txt amd-memory-encryption diff --git a/Documentation/x86/shstk.rst b/Documentation/x86/shstk.rst new file mode 100644 index 000000000000..f2e6f323cf68 --- /dev/null +++ b/Documentation/x86/shstk.rst @@ -0,0 +1,166 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================================== +Control-flow Enforcement Technology (CET) Shadow Stack +====================================================== + +CET Background +============== + +Control-flow Enforcement Technology (CET) is term referring to several +related x86 processor features that provides protection against control +flow hijacking attacks. The HW feature itself can be set up to protect +both applications and the kernel. + +CET introduces shadow stack and indirect branch tracking (IBT). Shadow stack +is a secondary stack allocated from memory and cannot be directly modified by +applications. When executing a CALL instruction, the processor pushes the +return address to both the normal stack and the shadow stack. Upon +function return, the processor pops the shadow stack copy and compares it +to the normal stack copy. If the two differ, the processor raises a +control-protection fault. IBT verifies indirect CALL/JMP targets are intended +as marked by the compiler with 'ENDBR' opcodes. Not all CPU's have both Shadow +Stack and Indirect Branch Tracking. Today in the 64-bit kernel, only userspace +shadow stack and kernel IBT are supported. + +Requirements to use Shadow Stack +================================ + +To use userspace shadow stack you need HW that supports it, a kernel +configured with it and userspace libraries compiled with it. + +The kernel Kconfig option is X86_USER_SHADOW_STACK, and it can be disabled +with the kernel parameter: nousershstk. + +To build a user shadow stack enabled kernel, Binutils v2.29 or LLVM v6 or later +are required. + +At run time, /proc/cpuinfo shows CET features if the processor supports +CET. "user_shstk" means that userspace shadow stack is supported on the current +kernel and HW. + +Application Enabling +==================== + +An application's CET capability is marked in its ELF note and can be verified +from readelf/llvm-readelf output:: + + readelf -n <application> | grep -a SHSTK + properties: x86 feature: SHSTK + +The kernel does not process these applications markers directly. Applications +or loaders must enable CET features using the interface described in section 4. +Typically this would be done in dynamic loader or static runtime objects, as is +the case in GLIBC. + +Enabling arch_prctl()'s +======================= + +Elf features should be enabled by the loader using the below arch_prctl's. They +are only supported in 64 bit user applications. + +arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature) + Enable a single feature specified in 'feature'. Can only operate on + one feature at a time. + +arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature) + Disable a single feature specified in 'feature'. Can only operate on + one feature at a time. + +arch_prctl(ARCH_SHSTK_LOCK, unsigned long features) + Lock in features at their current enabled or disabled status. 'features' + is a mask of all features to lock. All bits set are processed, unset bits + are ignored. The mask is ORed with the existing value. So any feature bits + set here cannot be enabled or disabled afterwards. + +The return values are as follows. On success, return 0. On error, errno can +be:: + + -EPERM if any of the passed feature are locked. + -ENOTSUPP if the feature is not supported by the hardware or + kernel. + -EINVAL arguments (non existing feature, etc) + +The feature's bits supported are:: + + ARCH_SHSTK_SHSTK - Shadow stack + ARCH_SHSTK_WRSS - WRSS + +Currently shadow stack and WRSS are supported via this interface. WRSS +can only be enabled with shadow stack, and is automatically disabled +if shadow stack is disabled. + +Proc Status +=========== +To check if an application is actually running with shadow stack, the +user can read the /proc/$PID/status. It will report "wrss" or "shstk" +depending on what is enabled. The lines look like this:: + + x86_Thread_features: shstk wrss + x86_Thread_features_locked: shstk wrss + +Implementation of the Shadow Stack +================================== + +Shadow Stack Size +----------------- + +A task's shadow stack is allocated from memory to a fixed size of +MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to +the maximum size of the normal stack, but capped to 4 GB. However, +a compat-mode application's address space is smaller, each of its thread's +shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB). + +Signal +------ + +By default, the main program and its signal handlers use the same shadow +stack. Because the shadow stack stores only return addresses, a large +shadow stack covers the condition that both the program stack and the +signal alternate stack run out. + +When a signal happens, the old pre-signal state is pushed on the stack. When +shadow stack is enabled, the shadow stack specific state is pushed onto the +shadow stack. Today this is only the old SSP (shadow stack pointer), pushed +in a special format with bit 63 set. On sigreturn this old SSP token is +verified and restored by the kernel. The kernel will also push the normal +restorer address to the shadow stack to help userspace avoid a shadow stack +violation on the sigreturn path that goes through the restorer. + +So the shadow stack signal frame format is as follows:: + + |1...old SSP| - Pointer to old pre-signal ssp in sigframe token format + (bit 63 set to 1) + | ...| - Other state may be added in the future + + +32 bit ABI signals are not supported in shadow stack processes. Linux prevents +32 bit execution while shadow stack is enabled by the allocating shadow stack's +outside of the 32 bit address space. When execution enters 32 bit mode, either +via far call or returning to userspace, a #GP is generated by the hardware +which, will be delivered to the process as a segfault. When transitioning to +userspace the register's state will be as if the userspace ip being returned to +caused the segfault. + +Fork +---- + +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required +to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a +shadow access triggers a page fault with the shadow stack access bit set +in the page fault error code. + +When a task forks a child, its shadow stack PTEs are copied and both the +parent's and the child's shadow stack PTEs are cleared of the dirty bit. +Upon the next shadow stack access, the resulting shadow stack page fault +is handled by page copy/re-use. + +When a pthread child is created, the kernel allocates a new shadow stack +for the new thread. New shadow stack's behave like mmap() with respect to +ASLR behavior. + +Exec +---- + +On exec, shadow stack features are disabled by the kernel. At which point, +userspace can choose to re-enable, or lock them.