Message ID | 20240203-arm64-gcs-v8-0-c9fec77673ef@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | arm64/gcs: Provide support for GCS in userspace | expand |
Hello, Mark Brown <broonie@kernel.org> writes: > Changes in v8: > - Invalidate signal cap token on stack when consuming. > - Typo and other trivial fixes. > - Don't try to use process_vm_write() on GCS, it intentionally does not > work. > - Fix leak of thread GCSs. > - Rebase onto latest clone3() series. > - Link to v7: https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org Thank you for addressing my comments. I still have a few nets and questions in a few patches, but regardless of them: Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
On Sat, Feb 3, 2024, at 7:25 AM, Mark Brown wrote: > The arm64 Guarded Control Stack (GCS) feature provides support for > hardware protected stacks of return addresses, intended to provide > hardening against return oriented programming (ROP) attacks and to make > it easier to gather call stacks for applications such as profiling. > > When GCS is active a secondary stack called the Guarded Control Stack is > maintained, protected with a memory attribute which means that it can > only be written with specific GCS operations. The current GCS pointer > can not be directly written to by userspace. When a BL is executed the > value stored in LR is also pushed onto the GCS, and when a RET is > executed the top of the GCS is popped and compared to LR with a fault > being raised if the values do not match. GCS operations may only be > performed on GCS pages, a data abort is generated if they are not. > > The combination of hardware enforcement and lack of extra instructions > in the function entry and exit paths should result in something which > has less overhead and is more difficult to attack than a purely software > implementation like clang's shadow stacks. > > This series implements support for use of GCS by userspace, along with > support for use of GCS within KVM guests. It does not enable use of GCS > by either EL1 or EL2, this will be implemented separately. Executables > are started without GCS and must use a prctl() to enable it, it is > expected that this will be done very early in application execution by > the dynamic linker or other startup code. For dynamic linking this will > be done by checking that everything in the executable is marked as GCS > compatible. > > x86 has an equivalent feature called shadow stacks, this series depends > on the x86 patches for generic memory management support for the new > guarded/shadow stack page type and shares APIs as much as possible. As > there has been extensive discussion with the wider community around the > ABI for shadow stacks I have as far as practical kept implementation > decisions close to those for x86, anticipating that review would lead to > similar conclusions in the absence of strong reasoning for divergence. > > The main divergence I am concious of is that x86 allows shadow stack to > be enabled and disabled repeatedly, freeing the shadow stack for the > thread whenever disabled, while this implementation keeps the GCS > allocated after disable but refuses to reenable it. This is to avoid > races with things actively walking the GCS during a disable, we do > anticipate that some systems will wish to disable GCS at runtime but are > not aware of any demand for subsequently reenabling it. > > x86 uses an arch_prctl() to manage enable and disable, since only x86 > and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a > patch set for the equivalent RISC-V Zicfiss feature which I initially > adopted fairly directly but following review feedback has been revised > quite a bit. While discussing the ABI implications of shadow stacks in the context of Zicfiss and musl a few days ago, I had the following idea for how to solve the source compatibility problems with shadow stacks in POSIX.1-2004 and POSIX.1-2017: 1. Introduce a "flexible shadow stack handling" option. For what follows, it doesn't matter if this is system-wide, per-mm, or per-vma. 2. Shadow stack faults on non-shadow stack pages, if flexible shadow stack handling is in effect, cause the affected page to become a shadow stack page. When this happens, the page filled with invalid address tokens. Faults from non-shadow-stack accesses to a shadow-stack page which was created by the previous paragraph will cause the page to revert to non-shadow-stack usage, with or without clearing. Important: a shadow stack operation can only load a valid address from a page if that page has been in continuous shadow stack use since the address was written by another shadow stack operation; the flexibility delays error reporting in cases of stray writes but it never allows for corruption of shadow stack operation. 3. Standards-defined operations which use a user-provided stack (makecontext, sigaltstack, pthread_attr_setstack) use a subrange of the provided stack for shadow stack storage. I propose to use a shadow stack size of 1/32 of the provided stack size, rounded up to a positive integer number of pages, and place the shadow stack allocation at the lowest page-aligned address inside the provided stack region. Since page usage is flexible, no change in page permissions is immediately needed; this merely sets the initial shadow stack pointer for the new context. If the shadow stack grew in the opposite direction to the architectural stack, it would not be necessary to pick a fixed direction. 4. SIGSTKSZ and MINSIGSTKSZ are increased by 2 pages to provide sufficient space for a minimum-sized shadow stack region and worst case alignment. _Without_ doing this, sigaltstack cannot be used to recover from stack overflows if the shadow stack limit is reached first, and makecontext cannot be supported without memory leaks and unreportable error conditions. Kernel-allocated shadow stacks with a unique VM type are still useful since they allows stray writes to crash at the time the stray write is performed, rather than delaying the crash until the next shadow stack read. The pthread and makecontext changes could be purely libc side, but we would need kernel support for sigaltstack and page usage changes. Luckily, there is no need to support stacks which are simultaneously used from more than one process, so "is this a shadow stack page" can be tracked purely at the vma/pte level without any need to involve the inode. POSIX explicitly allows using mmap to obtain stack memory and does not forbid MAP_SHARED; I consider this sufficiently perverse application behavior that it is not necessary to ensure exclusive use of the underlying pages while a shadow stack pte exists. (Applications that use MAP_SHARED for stacks do not get the full benefit of the shadow stack but they keep POSIX.1-2004 conformance, applications that allocate stacks exclusively in MAP_PRIVATE memory lose no security.) The largest complication of this scheme is likely to be that the shadow stack usage property of a page needs to survive the page being swapped out and back in, which likely means that it must be present in the swap PTE. I am substantially less familiar with GCS and SHSTK than with Zicfiss. It is likely that a syscall or other mechanism is needed to initialize the shadow stack in flexible memory for makecontext. Is there interest on the kernel side on having mechanisms to fully support POSIX.1-2004 with GCS or Zicfiss enabled? -s > We currently maintain the x86 pattern of implicitly allocating a shadow > stack for threads started with shadow stack enabled, there has been some > discussion of removing this support and requiring the use of clone3() > with explicit allocation of shadow stacks instead. I have no strong > feelings either way, implicit allocation is not really consistent with > anything else we do and creates the potential for errors around thread > exit but on the other hand it is existing ABI on x86 and minimises the > changes needed in userspace code. > > There is an open issue with support for CRIU, on x86 this required the > ability to set the GCS mode via ptrace. This series supports > configuring mode bits other than enable/disable via ptrace but it needs > to be confirmed if this is sufficient. > > The series depends on support for shadow stacks in clone3(), that series > includes the addition of ARCH_HAS_USER_SHADOW_STACK. > > > https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@kernel.org > > It also depends on the addition of more waitpid() flags to nolibc: > > > https://lore.kernel.org/r/20231023-nolibc-waitpid-flags-v2-1-b09d096f091f@kernel.org > > You can see a branch with the full set of dependencies against Linus' > tree at: > > https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs > > [1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ > > Signed-off-by: Mark Brown <broonie@kernel.org> > --- > Changes in v8: > - Invalidate signal cap token on stack when consuming. > - Typo and other trivial fixes. > - Don't try to use process_vm_write() on GCS, it intentionally does not > work. > - Fix leak of thread GCSs. > - Rebase onto latest clone3() series. > - Link to v7: > https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org > > Changes in v7: > - Rebase onto v6.7-rc2 via the clone3() patch series. > - Change the token used to cap the stack during signal handling to be > compatible with GCSPOPM. > - Fix flags for new page types. > - Fold in support for clone3(). > - Replace copy_to_user_gcs() with put_user_gcs(). > - Link to v6: > https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org > > Changes in v6: > - Rebase onto v6.6-rc3. > - Add some more gcsb_dsync() barriers following spec clarifications. > - Due to ongoing discussion around clone()/clone3() I've not updated > anything there, the behaviour is the same as on previous versions. > - Link to v5: > https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org > > Changes in v5: > - Don't map any permissions for user GCSs, we always use EL0 accessors > or use a separate mapping of the page. > - Reduce the standard size of the GCS to RLIMIT_STACK/2. > - Enforce a PAGE_SIZE alignment requirement on map_shadow_stack(). > - Clarifications and fixes to documentation. > - More tests. > - Link to v4: > https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org > > Changes in v4: > - Implement flags for map_shadow_stack() allowing the cap and end of > stack marker to be enabled independently or not at all. > - Relax size and alignment requirements for map_shadow_stack(). > - Add more blurb explaining the advantages of hardware enforcement. > - Link to v3: > https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org > > Changes in v3: > - Rebase onto v6.5-rc4. > - Add a GCS barrier on context switch. > - Add a GCS stress test. > - Link to v2: > https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org > > Changes in v2: > - Rebase onto v6.5-rc3. > - Rework prctl() interface to allow each bit to be locked independently. > - map_shadow_stack() now places the cap token based on the size > requested by the caller not the actual space allocated. > - Mode changes other than enable via ptrace are now supported. > - Expand test coverage. > - Various smaller fixes and adjustments. > - Link to v1: > https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org > > --- > Mark Brown (38): > arm64/mm: Restructure arch_validate_flags() for extensibility > prctl: arch-agnostic prctl for shadow stack > mman: Add map_shadow_stack() flags > arm64: Document boot requirements for Guarded Control Stacks > arm64/gcs: Document the ABI for Guarded Control Stacks > arm64/sysreg: Add definitions for architected GCS caps > arm64/gcs: Add manual encodings of GCS instructions > arm64/gcs: Provide put_user_gcs() > arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS) > arm64/mm: Allocate PIE slots for EL0 guarded control stack > mm: Define VM_SHADOW_STACK for arm64 when we support GCS > arm64/mm: Map pages for guarded control stack > KVM: arm64: Manage GCS registers for guests > arm64/gcs: Allow GCS usage at EL0 and EL1 > arm64/idreg: Add overrride for GCS > arm64/hwcap: Add hwcap for GCS > arm64/traps: Handle GCS exceptions > arm64/mm: Handle GCS data aborts > arm64/gcs: Context switch GCS state for EL0 > arm64/gcs: Ensure that new threads have a GCS > arm64/gcs: Implement shadow stack prctl() interface > arm64/mm: Implement map_shadow_stack() > arm64/signal: Set up and restore the GCS context for signal handlers > arm64/signal: Expose GCS state in signal frames > arm64/ptrace: Expose GCS via ptrace and core files > arm64: Add Kconfig for Guarded Control Stack (GCS) > kselftest/arm64: Verify the GCS hwcap > kselftest/arm64: Add GCS as a detected feature in the signal tests > kselftest/arm64: Add framework support for GCS to signal handling tests > kselftest/arm64: Allow signals tests to specify an expected si_code > kselftest/arm64: Always run signals tests with GCS enabled > kselftest/arm64: Add very basic GCS test program > kselftest/arm64: Add a GCS test program built with the system libc > kselftest/arm64: Add test coverage for GCS mode locking > selftests/arm64: Add GCS signal tests > kselftest/arm64: Add a GCS stress test > kselftest/arm64: Enable GCS for the FP stress tests > kselftest: Provide shadow stack enable helpers for arm64 > > Documentation/admin-guide/kernel-parameters.txt | 6 + > Documentation/arch/arm64/booting.rst | 22 + > Documentation/arch/arm64/elf_hwcaps.rst | 3 + > Documentation/arch/arm64/gcs.rst | 233 +++++++ > Documentation/arch/arm64/index.rst | 1 + > Documentation/filesystems/proc.rst | 2 +- > arch/arm64/Kconfig | 20 + > arch/arm64/include/asm/cpufeature.h | 6 + > arch/arm64/include/asm/el2_setup.h | 17 + > arch/arm64/include/asm/esr.h | 28 +- > arch/arm64/include/asm/exception.h | 2 + > arch/arm64/include/asm/gcs.h | 107 +++ > arch/arm64/include/asm/hwcap.h | 1 + > arch/arm64/include/asm/kvm_arm.h | 4 +- > arch/arm64/include/asm/kvm_host.h | 12 + > arch/arm64/include/asm/mman.h | 23 +- > arch/arm64/include/asm/pgtable-prot.h | 14 +- > arch/arm64/include/asm/processor.h | 7 + > arch/arm64/include/asm/sysreg.h | 20 + > arch/arm64/include/asm/uaccess.h | 40 ++ > arch/arm64/include/uapi/asm/hwcap.h | 1 + > arch/arm64/include/uapi/asm/ptrace.h | 8 + > arch/arm64/include/uapi/asm/sigcontext.h | 9 + > arch/arm64/kernel/cpufeature.c | 19 + > arch/arm64/kernel/cpuinfo.c | 1 + > arch/arm64/kernel/entry-common.c | 23 + > arch/arm64/kernel/idreg-override.c | 2 + > arch/arm64/kernel/process.c | 85 +++ > arch/arm64/kernel/ptrace.c | 59 ++ > arch/arm64/kernel/signal.c | 242 ++++++- > arch/arm64/kernel/traps.c | 11 + > arch/arm64/kvm/emulate-nested.c | 4 + > arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 17 + > arch/arm64/kvm/sys_regs.c | 22 + > arch/arm64/mm/Makefile | 1 + > arch/arm64/mm/fault.c | 79 ++- > arch/arm64/mm/gcs.c | 300 +++++++++ > arch/arm64/mm/mmap.c | 13 +- > arch/arm64/tools/cpucaps | 1 + > arch/x86/include/uapi/asm/mman.h | 3 - > fs/proc/task_mmu.c | 3 + > include/linux/mm.h | 16 +- > include/uapi/asm-generic/mman.h | 4 + > include/uapi/linux/elf.h | 1 + > include/uapi/linux/prctl.h | 22 + > kernel/sys.c | 30 + > tools/testing/selftests/arm64/Makefile | 2 +- > tools/testing/selftests/arm64/abi/hwcap.c | 19 + > tools/testing/selftests/arm64/fp/assembler.h | 15 + > tools/testing/selftests/arm64/fp/fpsimd-test.S | 2 + > tools/testing/selftests/arm64/fp/sve-test.S | 2 + > tools/testing/selftests/arm64/fp/za-test.S | 2 + > tools/testing/selftests/arm64/fp/zt-test.S | 2 + > tools/testing/selftests/arm64/gcs/.gitignore | 5 + > tools/testing/selftests/arm64/gcs/Makefile | 24 + > tools/testing/selftests/arm64/gcs/asm-offsets.h | 0 > tools/testing/selftests/arm64/gcs/basic-gcs.c | 428 ++++++++++++ > tools/testing/selftests/arm64/gcs/gcs-locking.c | 200 ++++++ > .../selftests/arm64/gcs/gcs-stress-thread.S | 311 +++++++++ > tools/testing/selftests/arm64/gcs/gcs-stress.c | 532 +++++++++++++++ > tools/testing/selftests/arm64/gcs/gcs-util.h | 100 +++ > tools/testing/selftests/arm64/gcs/libc-gcs.c | 736 +++++++++++++++++++++ > tools/testing/selftests/arm64/signal/.gitignore | 1 + > .../testing/selftests/arm64/signal/test_signals.c | 17 +- > .../testing/selftests/arm64/signal/test_signals.h | 6 + > .../selftests/arm64/signal/test_signals_utils.c | 32 +- > .../selftests/arm64/signal/test_signals_utils.h | 39 ++ > .../arm64/signal/testcases/gcs_exception_fault.c | 62 ++ > .../selftests/arm64/signal/testcases/gcs_frame.c | 88 +++ > .../arm64/signal/testcases/gcs_write_fault.c | 67 ++ > .../selftests/arm64/signal/testcases/testcases.c | 7 + > .../selftests/arm64/signal/testcases/testcases.h | 1 + > tools/testing/selftests/ksft_shstk.h | 37 ++ > 73 files changed, 4241 insertions(+), 40 deletions(-) > --- > base-commit: 50abefbf1bc07f5c4e403fd28f71dcee855100f7 > change-id: 20230303-arm64-gcs-e311ab0d8729 > > Best regards, > -- > Mark Brown <broonie@kernel.org> > > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv
Hi, I worked on the x86 kernel shadow stack support. I think it is an interesting suggestion. Some questions below, and I will think more on it. On Tue, 2024-02-20 at 11:36 -0500, Stefan O'Rear wrote: > While discussing the ABI implications of shadow stacks in the context > of > Zicfiss and musl a few days ago, I had the following idea for how to > solve > the source compatibility problems with shadow stacks in POSIX.1-2004 > and > POSIX.1-2017: > > 1. Introduce a "flexible shadow stack handling" option. For what > follows, > it doesn't matter if this is system-wide, per-mm, or per-vma. > > 2. Shadow stack faults on non-shadow stack pages, if flexible shadow > stack > handling is in effect, cause the affected page to become a shadow > stack > page. When this happens, the page filled with invalid address > tokens. Hmm, could the shadow stack underflow onto the real stack then? Not sure how bad that is. INCSSP (incrementing the SSP register on x86) loops are not rare so it seems like something that could happen. > > Faults from non-shadow-stack accesses to a shadow-stack page which > was > created by the previous paragraph will cause the page to revert to > non-shadow-stack usage, with or without clearing. Won't this prevent catching stack overflows when they happen? An overflow will just turn the shadow stack into normal stack and only get detected when the shadow stack unwinds? A related question would be how to handle the expanding nature of the initial stack. I guess the initial stack could be special and have a separate shadow stack. > > Important: a shadow stack operation can only load a valid address > from > a page if that page has been in continuous shadow stack use since > the > address was written by another shadow stack operation; the > flexibility > delays error reporting in cases of stray writes but it never > allows for > corruption of shadow stack operation. Shadow stacks currently have automatic guard gaps to try to prevent one thread from overflowing onto another thread's shadow stack. This would somewhat opens that up, as the stack guard gaps are usually maintained by userspace for new threads. It would have to be thought through if these could still be enforced with checking at additional spots. > > 3. Standards-defined operations which use a user-provided stack > (makecontext, sigaltstack, pthread_attr_setstack) use a subrange > of the > provided stack for shadow stack storage. I propose to use a > shadow > stack size of 1/32 of the provided stack size, rounded up to a > positive > integer number of pages, and place the shadow stack allocation at > the > lowest page-aligned address inside the provided stack region. > > Since page usage is flexible, no change in page permissions is > immediately needed; this merely sets the initial shadow stack > pointer for > the new context. > > If the shadow stack grew in the opposite direction to the > architectural > stack, it would not be necessary to pick a fixed direction. > > 4. SIGSTKSZ and MINSIGSTKSZ are increased by 2 pages to provide > sufficient > space for a minimum-sized shadow stack region and worst case > alignment. Do all makecontext() callers ensure the size is greater than this? I guess glibc's makecontext() could do this scheme to prevent leaking without any changes to the kernel. Basically steal a little of the stack address range and overwrite it with a shadow stack mapping. But only if the apps leave enough room. If they need to be updated, then they could be updated to manage their own shadow stacks too I think. > > _Without_ doing this, sigaltstack cannot be used to recover from > stack > overflows if the shadow stack limit is reached first, and makecontext > cannot be supported without memory leaks and unreportable error > conditions. FWIW, I think the makecontext() shadow stack leaking is a bad idea. I would prefer the existing makecontext() interface just didn't support shadow stack, rather than the leaking solution glibc does today. The situation (for arm and riscv too I think?) is that some applications will just not work automatically due to custom stack switching implementations. (user level threading libraries, JITs, etc). So I think it should be ok to ask for apps to change to enable shadow stack and we should avoid doing anything too awkward in pursuit of getting it to work completely transparently. For ucontext, there was some discussion about implementing changes to the interface makecontext() interface that allows the app to allocate and manage their own shadow stacks. So they would be responsible for freeing and allocating the shadow stacks. It seems a little more straightforward. For x86, due to some existing GCC binaries that jumped ahead of the kernel support, it will likely require an ABI opt-in to enable alt shadow stacks. So alt shadow stack support design is still pretty open on the x86 side. Very glad to get broader input on it. > > Kernel-allocated shadow stacks with a unique VM type are still useful > since > they allows stray writes to crash at the time the stray write is > performed, > rather than delaying the crash until the next shadow stack read. > > The pthread and makecontext changes could be purely libc side, but we > would > need kernel support for sigaltstack and page usage changes. > > Luckily, there is no need to support stacks which are simultaneously > used > from more than one process, so "is this a shadow stack page" can be > tracked > purely at the vma/pte level without any need to involve the inode. > POSIX > explicitly allows using mmap to obtain stack memory and does not > forbid > MAP_SHARED; I consider this sufficiently perverse application > behavior that > it is not necessary to ensure exclusive use of the underlying pages > while > a shadow stack pte exists. (Applications that use MAP_SHARED for > stacks > do not get the full benefit of the shadow stack but they keep > POSIX.1-2004 > conformance, applications that allocate stacks exclusively in > MAP_PRIVATE > memory lose no security.) On x86 we don't support MAP_SHARED shadow stacks. There is a whole snarl around the dirty bit in the PTE. I'm not sure it's impossible but it was gladly avoided. There is also a benefit in avoiding having them get mapped as writable in a different context. > > The largest complication of this scheme is likely to be that the > shadow > stack usage property of a page needs to survive the page being > swapped out > and back in, which likely means that it must be present in the swap > PTE. > > I am substantially less familiar with GCS and SHSTK than with > Zicfiss. > It is likely that a syscall or other mechanism is needed to > initialize the > shadow stack in flexible memory for makecontext. The ucontext stacks (and alt shadow stacks is the plan) need to have a "restore token". So, yea, you would probably need some syscall to "convert" the normal stack memory into shadow stack with a restore token. > > Is there interest on the kernel side on having mechanisms to fully > support > POSIX.1-2004 with GCS or Zicfiss enabled? Can you clarify, is the goal to meet compatibility with the spec or try to make more apps run with shadow stack automatically?
On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote: > Hi, > > I worked on the x86 kernel shadow stack support. I think it is an > interesting suggestion. Some questions below, and I will think more on > it. > > On Tue, 2024-02-20 at 11:36 -0500, Stefan O'Rear wrote: > > While discussing the ABI implications of shadow stacks in the context > > of > > Zicfiss and musl a few days ago, I had the following idea for how to > > solve > > the source compatibility problems with shadow stacks in POSIX.1-2004 > > and > > POSIX.1-2017: > > > > 1. Introduce a "flexible shadow stack handling" option. For what > > follows, > > it doesn't matter if this is system-wide, per-mm, or per-vma. > > > > 2. Shadow stack faults on non-shadow stack pages, if flexible shadow > > stack > > handling is in effect, cause the affected page to become a shadow > > stack > > page. When this happens, the page filled with invalid address > > tokens. > > Hmm, could the shadow stack underflow onto the real stack then? Not > sure how bad that is. INCSSP (incrementing the SSP register on x86) > loops are not rare so it seems like something that could happen. Shadow stack underflow should fault on attempt to access non-shadow-stack memory as shadow-stack, no? > > Faults from non-shadow-stack accesses to a shadow-stack page which > > was > > created by the previous paragraph will cause the page to revert to > > non-shadow-stack usage, with or without clearing. > > Won't this prevent catching stack overflows when they happen? An > overflow will just turn the shadow stack into normal stack and only get > detected when the shadow stack unwinds? I don't think that's as big a problem as it sounds like. It might make pinpointing the spot at which things went wrong take a little bit more work, but it should not admit any wrong-execution. > A related question would be how to handle the expanding nature of the > initial stack. I guess the initial stack could be special and have a > separate shadow stack. That seems fine. > > Important: a shadow stack operation can only load a valid address > > from > > a page if that page has been in continuous shadow stack use since > > the > > address was written by another shadow stack operation; the > > flexibility > > delays error reporting in cases of stray writes but it never > > allows for > > corruption of shadow stack operation. > > Shadow stacks currently have automatic guard gaps to try to prevent one > thread from overflowing onto another thread's shadow stack. This would > somewhat opens that up, as the stack guard gaps are usually maintained > by userspace for new threads. It would have to be thought through if > these could still be enforced with checking at additional spots. I would think the existing guard pages would already do that if a thread's shadow stack is contiguous with its own data stack. > > 3. Standards-defined operations which use a user-provided stack > > (makecontext, sigaltstack, pthread_attr_setstack) use a subrange > > of the > > provided stack for shadow stack storage. I propose to use a > > shadow > > stack size of 1/32 of the provided stack size, rounded up to a > > positive > > integer number of pages, and place the shadow stack allocation at > > the > > lowest page-aligned address inside the provided stack region. > > > > Since page usage is flexible, no change in page permissions is > > immediately needed; this merely sets the initial shadow stack > > pointer for > > the new context. > > > > If the shadow stack grew in the opposite direction to the > > architectural > > stack, it would not be necessary to pick a fixed direction. > > > > 4. SIGSTKSZ and MINSIGSTKSZ are increased by 2 pages to provide > > sufficient > > space for a minimum-sized shadow stack region and worst case > > alignment. > > Do all makecontext() callers ensure the size is greater than this? > > I guess glibc's makecontext() could do this scheme to prevent leaking > without any changes to the kernel. Basically steal a little of the > stack address range and overwrite it with a shadow stack mapping. But > only if the apps leave enough room. If they need to be updated, then > they could be updated to manage their own shadow stacks too I think. From the musl side, I have always looked at the entirely of shadow stack stuff with very heavy skepticism, and anything that breaks existing interface contracts, introduced places where apps can get auto-killed because a late resource allocation fails, or requires applications to code around the existence of something that should be an implementation detail, is a non-starter. To even consider shadow stack support, it must truely be fully non-breaking. > > _Without_ doing this, sigaltstack cannot be used to recover from > > stack > > overflows if the shadow stack limit is reached first, and makecontext > > cannot be supported without memory leaks and unreportable error > > conditions. > > FWIW, I think the makecontext() shadow stack leaking is a bad idea. I > would prefer the existing makecontext() interface just didn't support > shadow stack, rather than the leaking solution glibc does today. AIUI the proposal by Stefan makes it non-leaking because it's just using normal memory that reverts to normal usage on any non-shadow-stack access. Rich
On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote: > On Tue, 2024-02-20 at 11:36 -0500, Stefan O'Rear wrote: > > 2. Shadow stack faults on non-shadow stack pages, if flexible shadow > > stack > > handling is in effect, cause the affected page to become a shadow > > stack > > page. When this happens, the page filled with invalid address > > tokens. > Hmm, could the shadow stack underflow onto the real stack then? Not > sure how bad that is. INCSSP (incrementing the SSP register on x86) > loops are not rare so it seems like something that could happen. Yes, they'd trash any pages of normal stack they touch as they do so but otherwise seems similar to overflow. > The situation (for arm and riscv too I think?) is that some > applications will just not work automatically due to custom stack > switching implementations. (user level threading libraries, JITs, etc). > So I think it should be ok to ask for apps to change to enable shadow > stack and we should avoid doing anything too awkward in pursuit of > getting it to work completely transparently. Yes, on arm64 anything that rewrites or is otherwise clever with the stack is going to have to understand that the GCS exists on arm64 and do matching rewrites/updates for the GCS. This includes anything that switches stacks, it will need to use GCS specific instructions to change the current shadow stack pointer. > > MAP_SHARED; I consider this sufficiently perverse application > > behavior that > > it is not necessary to ensure exclusive use of the underlying pages > > while > > a shadow stack pte exists. (Applications that use MAP_SHARED for > > stacks > > do not get the full benefit of the shadow stack but they keep > > POSIX.1-2004 > > conformance, applications that allocate stacks exclusively in > > MAP_PRIVATE > > memory lose no security.) > On x86 we don't support MAP_SHARED shadow stacks. There is a whole > snarl around the dirty bit in the PTE. I'm not sure it's impossible but > it was gladly avoided. There is also a benefit in avoiding having them > get mapped as writable in a different context. Similarly for arm64, I think we can physically do it IIRC but between having to map via map_shadow_stack() for security reasons and it just generally not seeming like a clever idea the implementation shouldn't actually let you get a MAP_SHARED GCS it's not something that's been considered. > > I am substantially less familiar with GCS and SHSTK than with > > Zicfiss. > > It is likely that a syscall or other mechanism is needed to > > initialize the > > shadow stack in flexible memory for makecontext. > The ucontext stacks (and alt shadow stacks is the plan) need to have a > "restore token". So, yea, you would probably need some syscall to > "convert" the normal stack memory into shadow stack with a restore > token. Similar considerations for GCS, we need tokens and we don't want userspace to be able to write by itself in the normal case.
On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote: > On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote: > > Hmm, could the shadow stack underflow onto the real stack then? Not > > sure how bad that is. INCSSP (incrementing the SSP register on x86) > > loops are not rare so it seems like something that could happen. > > Shadow stack underflow should fault on attempt to access > non-shadow-stack memory as shadow-stack, no? Maybe I'm misunderstanding. I thought the proposal included allowing shadow stack access to convert normal address ranges to shadow stack, and normal writes to convert shadow stack to normal. > > > > Won't this prevent catching stack overflows when they happen? An > > overflow will just turn the shadow stack into normal stack and only > > get > > detected when the shadow stack unwinds? > > I don't think that's as big a problem as it sounds like. It might > make > pinpointing the spot at which things went wrong take a little bit > more > work, but it should not admit any wrong-execution. Right, it's a point about debugging. I'm just trying to analyze the pros and cons and not calling it a showstopper. > > > > Shadow stacks currently have automatic guard gaps to try to prevent > > one > > thread from overflowing onto another thread's shadow stack. This > > would > > somewhat opens that up, as the stack guard gaps are usually > > maintained > > by userspace for new threads. It would have to be thought through > > if > > these could still be enforced with checking at additional spots. > > I would think the existing guard pages would already do that if a > thread's shadow stack is contiguous with its own data stack. The difference is that the kernel provides the guard gaps, where this would rely on userspace to do it. It is not a showstopper either. I think my biggest question on this is how does it change the capability for two threads to share a shadow stack. It might require some special rules around the syscall that writes restore tokens. So I'm not sure. It probably needs a POC. > > From the musl side, I have always looked at the entirely of shadow > stack stuff with very heavy skepticism, and anything that breaks > existing interface contracts, introduced places where apps can get > auto-killed because a late resource allocation fails, or requires > applications to code around the existence of something that should be > an implementation detail, is a non-starter. To even consider shadow > stack support, it must truely be fully non-breaking. The manual assembly stack switching and JIT code in the apps needs to be updated. I don't think there is a way around it. I agree though that the late allocation failures are not great. Mark is working on clone3 support which should allow moving the shadow stack allocation to happen in userspace with the normal stack. Even for riscv though, doesn't it need to update a new register in stack switching? BTW, x86 shadow stack has a mode where the shadow stack is writable with a special instruction (WRSS). It enables the SSP to be set arbitrarily by writing restore tokens. We discussed this as an option to make the existing longjmp() and signal stuff work more transparently for glibc. > > > > _Without_ doing this, sigaltstack cannot be used to recover from > > > stack > > > overflows if the shadow stack limit is reached first, and > > > makecontext > > > cannot be supported without memory leaks and unreportable error > > > conditions. > > > > FWIW, I think the makecontext() shadow stack leaking is a bad idea. > > I > > would prefer the existing makecontext() interface just didn't > > support > > shadow stack, rather than the leaking solution glibc does today. > > AIUI the proposal by Stefan makes it non-leaking because it's just > using normal memory that reverts to normal usage on any > non-shadow-stack access. > Right, but does it break any existing apps anyway (because of small ucontext stack sizes)? BTW, when I talk about "not supporting" I don't mean the app should crash. I mean it should instead run normally, just without shadow stack enabled. Not sure if that was clear. Since shadow stack is not essential for an application to function, it is only security hardening on top. Although determining if an application supports shadow stack has turned out to be difficult in practice. Handling dlopen() is especially hard.
On Tue, 2024-02-20 at 20:14 +0000, Mark Brown wrote: > > Hmm, could the shadow stack underflow onto the real stack then? Not > > sure how bad that is. INCSSP (incrementing the SSP register on x86) > > loops are not rare so it seems like something that could happen. > > Yes, they'd trash any pages of normal stack they touch as they do so > but > otherwise seems similar to overflow. I was thinking in the normal buffer overflow case there is a guard gap at the end of the stack, but in this case the shadow stack is directly adjacent to the regular stack. It's probably a minor point.
On Tue, Feb 20, 2024 at 11:30:22PM +0000, Edgecombe, Rick P wrote: > On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote: > > On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote: > > > Hmm, could the shadow stack underflow onto the real stack then? Not > > > sure how bad that is. INCSSP (incrementing the SSP register on x86) > > > loops are not rare so it seems like something that could happen. > > > > Shadow stack underflow should fault on attempt to access > > non-shadow-stack memory as shadow-stack, no? > > Maybe I'm misunderstanding. I thought the proposal included allowing > shadow stack access to convert normal address ranges to shadow stack, > and normal writes to convert shadow stack to normal. As I understood the original discussion of the proposal on IRC, it was only one-way (from shadow to normal). Unless I'm missing something, making it one-way is necessary to catch situations where the shadow stack would become compromised. > > > Shadow stacks currently have automatic guard gaps to try to prevent > > > one > > > thread from overflowing onto another thread's shadow stack. This > > > would > > > somewhat opens that up, as the stack guard gaps are usually > > > maintained > > > by userspace for new threads. It would have to be thought through > > > if > > > these could still be enforced with checking at additional spots. > > > > I would think the existing guard pages would already do that if a > > thread's shadow stack is contiguous with its own data stack. > > The difference is that the kernel provides the guard gaps, where this > would rely on userspace to do it. It is not a showstopper either. > > I think my biggest question on this is how does it change the > capability for two threads to share a shadow stack. It might require > some special rules around the syscall that writes restore tokens. So > I'm not sure. It probably needs a POC. Why would they be sharing a shadow stack? > > From the musl side, I have always looked at the entirely of shadow > > stack stuff with very heavy skepticism, and anything that breaks > > existing interface contracts, introduced places where apps can get > > auto-killed because a late resource allocation fails, or requires > > applications to code around the existence of something that should be > > an implementation detail, is a non-starter. To even consider shadow > > stack support, it must truely be fully non-breaking. > > The manual assembly stack switching and JIT code in the apps needs to > be updated. I don't think there is a way around it. Indeed, I'm not talking about programs with JIT/manual stack-switching asm, just anything using existing APIs for control of stack -- pthread_setstack, makecontext, sigaltstack, etc. > I agree though that the late allocation failures are not great. Mark is > working on clone3 support which should allow moving the shadow stack > allocation to happen in userspace with the normal stack. Even for riscv > though, doesn't it need to update a new register in stack switching? If clone is called with signals masked, it's probably not necessary for the kernel to set the shadow stack register as part of clone3. > BTW, x86 shadow stack has a mode where the shadow stack is writable > with a special instruction (WRSS). It enables the SSP to be set > arbitrarily by writing restore tokens. We discussed this as an option > to make the existing longjmp() and signal stuff work more transparently > for glibc. > > > > > > > _Without_ doing this, sigaltstack cannot be used to recover from > > > > stack > > > > overflows if the shadow stack limit is reached first, and > > > > makecontext > > > > cannot be supported without memory leaks and unreportable error > > > > conditions. > > > > > > FWIW, I think the makecontext() shadow stack leaking is a bad idea. > > > I > > > would prefer the existing makecontext() interface just didn't > > > support > > > shadow stack, rather than the leaking solution glibc does today. > > > > AIUI the proposal by Stefan makes it non-leaking because it's just > > using normal memory that reverts to normal usage on any > > non-shadow-stack access. > > > > Right, but does it break any existing apps anyway (because of small > ucontext stack sizes)? > > BTW, when I talk about "not supporting" I don't mean the app should > crash. I mean it should instead run normally, just without shadow stack > enabled. Not sure if that was clear. Since shadow stack is not > essential for an application to function, it is only security hardening > on top. > > Although determining if an application supports shadow stack has turned > out to be difficult in practice. Handling dlopen() is especially hard. One reasonable thing to do, that might be preferable to overengineered solutions, is to disable shadow-stack process-wide if an interface incompatible with it is used (sigaltstack, pthread_create with an attribute setup using pthread_attr_setstack, makecontext, etc.), as well as if an incompatible library is is dlopened. This is much more acceptable than continuing to run with shadow stacks managed sloppily by the kernel and async killing the process on OOM, and is probably *more compatible* with apps than changing the minimum stack size requirements out from under them. The place where it's really needed to be able to allocate the shadow stack synchronously under userspace control, in order to harden normal applications that aren't doing funny things, is in pthread_create without a caller-provided stack. Rich
On Tue, Feb 20, 2024, at 6:30 PM, Edgecombe, Rick P wrote: > On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote: >> On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote: >> > Hmm, could the shadow stack underflow onto the real stack then? Not >> > sure how bad that is. INCSSP (incrementing the SSP register on x86) >> > loops are not rare so it seems like something that could happen. >> >> Shadow stack underflow should fault on attempt to access >> non-shadow-stack memory as shadow-stack, no? > > Maybe I'm misunderstanding. I thought the proposal included allowing > shadow stack access to convert normal address ranges to shadow stack, > and normal writes to convert shadow stack to normal. Ideally for riscv only writes would cause conversion, an incssp underflow which performs shadow stack reads would be able to fault early. For arm, since a syscall is needed anyway to set up the token in a new shadow stack region, it would make sense for conversion from non-shadow to shadow usage to never be automatic. >> > >> > Won't this prevent catching stack overflows when they happen? An >> > overflow will just turn the shadow stack into normal stack and only >> > get >> > detected when the shadow stack unwinds? >> >> I don't think that's as big a problem as it sounds like. It might >> make >> pinpointing the spot at which things went wrong take a little bit >> more >> work, but it should not admit any wrong-execution. > > Right, it's a point about debugging. I'm just trying to analyze the > pros and cons and not calling it a showstopper. It's certainly undesirable, so I'd like to have both mechanisms available (shadow stacks in ordinary memory to support several problematic APIs, and in dedicated mappings with guard pages otherwise). >> > >> > Shadow stacks currently have automatic guard gaps to try to prevent >> > one >> > thread from overflowing onto another thread's shadow stack. This >> > would >> > somewhat opens that up, as the stack guard gaps are usually >> > maintained >> > by userspace for new threads. It would have to be thought through >> > if >> > these could still be enforced with checking at additional spots. >> >> I would think the existing guard pages would already do that if a >> thread's shadow stack is contiguous with its own data stack. > > The difference is that the kernel provides the guard gaps, where this > would rely on userspace to do it. It is not a showstopper either. > > I think my biggest question on this is how does it change the > capability for two threads to share a shadow stack. It might require > some special rules around the syscall that writes restore tokens. So > I'm not sure. It probably needs a POC. I'm not quite understanding what the property you're looking for here is. >> From the musl side, I have always looked at the entirely of shadow >> stack stuff with very heavy skepticism, and anything that breaks >> existing interface contracts, introduced places where apps can get >> auto-killed because a late resource allocation fails, or requires >> applications to code around the existence of something that should be >> an implementation detail, is a non-starter. To even consider shadow >> stack support, it must truely be fully non-breaking. > > The manual assembly stack switching and JIT code in the apps needs to > be updated. I don't think there is a way around it. Naturally. If an application uses nonportable functionality like JIT and inline assembly, it's fine (within reason) for those nonportable components to need changes for shadow stack support. The objective of this proposal is to allow applications that do _not_ use inline assembly but rather only C APIs defined in POSIX.1-2004 to execute correctly in an environment where shadow stacks are enabled by default. > I agree though that the late allocation failures are not great. Mark is > working on clone3 support which should allow moving the shadow stack > allocation to happen in userspace with the normal stack. Even for riscv > though, doesn't it need to update a new register in stack switching? > > BTW, x86 shadow stack has a mode where the shadow stack is writable > with a special instruction (WRSS). It enables the SSP to be set > arbitrarily by writing restore tokens. We discussed this as an option > to make the existing longjmp() and signal stuff work more transparently > for glibc. > >> >> > > _Without_ doing this, sigaltstack cannot be used to recover from >> > > stack >> > > overflows if the shadow stack limit is reached first, and >> > > makecontext >> > > cannot be supported without memory leaks and unreportable error >> > > conditions. >> > >> > FWIW, I think the makecontext() shadow stack leaking is a bad idea. >> > I >> > would prefer the existing makecontext() interface just didn't >> > support >> > shadow stack, rather than the leaking solution glibc does today. >> >> AIUI the proposal by Stefan makes it non-leaking because it's just >> using normal memory that reverts to normal usage on any >> non-shadow-stack access. >> > > Right, but does it break any existing apps anyway (because of small > ucontext stack sizes)? Possibly, but that's what SIGSTKSZ/MINSIGSTKSZ is for. This is already variable on several platforms due to variable-length vector extensions. > BTW, when I talk about "not supporting" I don't mean the app should > crash. I mean it should instead run normally, just without shadow stack > enabled. Not sure if that was clear. Since shadow stack is not > essential for an application to function, it is only security hardening > on top. I appreciate that. How far can we go in that direction? If we can automatically disable shadow stacks on any call to makecontext, sigaltstack, or pthread_attr_setstack without causing other threads to crash if they were in the middle of shadow stack maintenance we can probably simplify this proposal, although I need to think more about what's possible. > Although determining if an application supports shadow stack has turned > out to be difficult in practice. Handling dlopen() is especially hard. How so? Is the hard part figuring out if you need to do something, or doing it? -s
On Tue, 2024-02-20 at 18:54 -0500, dalias@libc.org wrote: > On Tue, Feb 20, 2024 at 11:30:22PM +0000, Edgecombe, Rick P wrote: > > On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote: > > > On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P > > > wrote: > > > > Hmm, could the shadow stack underflow onto the real stack then? > > > > Not > > > > sure how bad that is. INCSSP (incrementing the SSP register on > > > > x86) > > > > loops are not rare so it seems like something that could > > > > happen. > > > > > > Shadow stack underflow should fault on attempt to access > > > non-shadow-stack memory as shadow-stack, no? > > > > Maybe I'm misunderstanding. I thought the proposal included > > allowing > > shadow stack access to convert normal address ranges to shadow > > stack, > > and normal writes to convert shadow stack to normal. > > As I understood the original discussion of the proposal on IRC, it > was > only one-way (from shadow to normal). Unless I'm missing something, > making it one-way is necessary to catch situations where the shadow > stack would become compromised. The original post here: https://lore.kernel.org/lkml/22a53b78-10d7-4a5a-a01e-b2f3a8c22e94@app.fastmail.com/ ...has: "Shadow stack faults on non-shadow stack pages, if flexible shadow stack handling is in effect, cause the affected page to become a shadow stack page. When this happens, the page filled with invalid address tokens." ...and: "Faults from non-shadow-stack accesses to a shadow-stack page which was created by the previous paragraph will cause the page to revert to non- shadow-stack usage, with or without clearing." I see Stefan has clarified in another response. So I'll go try to figure it out. > > > > > Shadow stacks currently have automatic guard gaps to try to > > > > prevent > > > > one > > > > thread from overflowing onto another thread's shadow stack. > > > > This > > > > would > > > > somewhat opens that up, as the stack guard gaps are usually > > > > maintained > > > > by userspace for new threads. It would have to be thought > > > > through > > > > if > > > > these could still be enforced with checking at additional > > > > spots. > > > > > > I would think the existing guard pages would already do that if a > > > thread's shadow stack is contiguous with its own data stack. > > > > The difference is that the kernel provides the guard gaps, where > > this > > would rely on userspace to do it. It is not a showstopper either. > > > > I think my biggest question on this is how does it change the > > capability for two threads to share a shadow stack. It might > > require > > some special rules around the syscall that writes restore tokens. > > So > > I'm not sure. It probably needs a POC. > > Why would they be sharing a shadow stack? The guard gap was introduced originally based on a suggestion that overflowing a shadow stack onto an adjacent shadow stack could cause corruption that could be used by an attacker to work around the protection. There wasn't any concrete demonstrated attacks or suggestion that all the protection was moot. But when we talk about capabilities for converting memory to shadow stack with simple memory accesses, and syscalls that can write restore token to shadow stacks, it's not immediately clear to me that it wouldn't open up something like that. Like if two restore tokens were written to a shadow stack, or two shadow stacks were adjacent with normal memory between them that later got converted to shadow stack. Those sorts of scenarios, but I won't lean on those specific examples. Sorry for being hand wavy. It's just where I'm at, at this point. > > > > From the musl side, I have always looked at the entirely of > > > shadow > > > stack stuff with very heavy skepticism, and anything that breaks > > > existing interface contracts, introduced places where apps can > > > get > > > auto-killed because a late resource allocation fails, or requires > > > applications to code around the existence of something that > > > should be > > > an implementation detail, is a non-starter. To even consider > > > shadow > > > stack support, it must truely be fully non-breaking. > > > > The manual assembly stack switching and JIT code in the apps needs > > to > > be updated. I don't think there is a way around it. > > Indeed, I'm not talking about programs with JIT/manual stack- > switching > asm, just anything using existing APIs for control of stack -- > pthread_setstack, makecontext, sigaltstack, etc. Then I think WRSS might fit your requirements better than what glibc did. It was considered a reduced security mode that made libc's job much easier and had better compatibility, but the last discussion was to try to do it without WRSS. > > > I agree though that the late allocation failures are not great. > > Mark is > > working on clone3 support which should allow moving the shadow > > stack > > allocation to happen in userspace with the normal stack. Even for > > riscv > > though, doesn't it need to update a new register in stack > > switching? > > If clone is called with signals masked, it's probably not necessary > for the kernel to set the shadow stack register as part of clone3. So you would want a mode of clone3 that basically leaves the shadow stack bits alone? Mark was driving that effort, but it doesn't seem horrible to me on first impression. If it would open up the possibility of musl support. > > > BTW, x86 shadow stack has a mode where the shadow stack is writable > > with a special instruction (WRSS). It enables the SSP to be set > > arbitrarily by writing restore tokens. We discussed this as an > > option > > to make the existing longjmp() and signal stuff work more > > transparently > > for glibc. > > > > > > > > BTW, when I talk about "not supporting" I don't mean the app should > > crash. I mean it should instead run normally, just without shadow > > stack > > enabled. Not sure if that was clear. Since shadow stack is not > > essential for an application to function, it is only security > > hardening > > on top. > > > > Although determining if an application supports shadow stack has > > turned > > out to be difficult in practice. Handling dlopen() is especially > > hard. > > One reasonable thing to do, that might be preferable to > overengineered > solutions, is to disable shadow-stack process-wide if an interface > incompatible with it is used (sigaltstack, pthread_create with an > attribute setup using pthread_attr_setstack, makecontext, etc.), as > well as if an incompatible library is is dlopened. I think it would be an interesting approach to determining compatibility. On x86 there has been cases of binaries getting mismarked as supporting shadow stack. So an automated way of filtering some of those out would be very useful I think. I guess the dynamic linker could determine this based on some list of functions? The dlopen() bit gets complicated though. You need to disable shadow stack for all threads, which presumably the kernel could be coaxed into doing. But those threads might be using shadow stack instructions (INCSSP, RSTORSSP, etc). These are a collection of instructions that allow limited control of the SSP. When shadow stack gets disabled, these suddenly turn into #UD generating instructions. So any other threads executing those instructions when shadow stack got disabled would be in for a nasty surprise. Glibc's permissive mode (that disables shadow stack when dlopen()ing a DSO that doesn't support shadow stack) is quite limited because of this. There was a POC for working around it, but I'll stop there for now, to not spam you with the details. I'm not sure of arm and risc-v details on this specific corner, but for x86. > This is much more > acceptable than continuing to run with shadow stacks managed sloppily > by the kernel and async killing the process on OOM, and is probably > *more compatible* with apps than changing the minimum stack size > requirements out from under them. Yep. > > The place where it's really needed to be able to allocate the shadow > stack synchronously under userspace control, in order to harden > normal > applications that aren't doing funny things, is in pthread_create > without a caller-provided stack. Yea most apps don't do anything too tricky. Mostly shadow stack "just works". But it's no excuse to just crash for the others.
On Tue, Feb 20, 2024 at 06:59:58PM -0500, Stefan O'Rear wrote: > On Tue, Feb 20, 2024, at 6:30 PM, Edgecombe, Rick P wrote: > > Maybe I'm misunderstanding. I thought the proposal included allowing > > shadow stack access to convert normal address ranges to shadow stack, > > and normal writes to convert shadow stack to normal. > Ideally for riscv only writes would cause conversion, an incssp underflow > which performs shadow stack reads would be able to fault early. > For arm, since a syscall is needed anyway to set up the token in a new > shadow stack region, it would make sense for conversion from non-shadow > to shadow usage to never be automatic. Well, we only need the token to pivot in userspace so we could *potentially* work something out as part of the conversion process. It's not filling me with enthusiasm though, and I've certainly not actually thought it through yet. > > I agree though that the late allocation failures are not great. Mark is > > working on clone3 support which should allow moving the shadow stack > > allocation to happen in userspace with the normal stack. Even for riscv > > though, doesn't it need to update a new register in stack switching? > > BTW, x86 shadow stack has a mode where the shadow stack is writable > > with a special instruction (WRSS). It enables the SSP to be set > > arbitrarily by writing restore tokens. We discussed this as an option > > to make the existing longjmp() and signal stuff work more transparently > > for glibc. We have this feature on arm64 too, plus a separately controllable push instruction (though that's less useful here). > > BTW, when I talk about "not supporting" I don't mean the app should > > crash. I mean it should instead run normally, just without shadow stack > > enabled. Not sure if that was clear. Since shadow stack is not > > essential for an application to function, it is only security hardening > > on top. > I appreciate that. How far can we go in that direction? If we can > automatically disable shadow stacks on any call to makecontext, sigaltstack, > or pthread_attr_setstack without causing other threads to crash if they were > in the middle of shadow stack maintenance we can probably simplify this > proposal, although I need to think more about what's possible. Aside from concerns about disabling over all the threads in the process (which should be solvable if annoying) this would be incompatible with policies which prevent disabling of shadow stacks, and it feels like it might end up being a gadget people could use which will concern some people. There's a tension here between compatibility and the security applications of these features.
On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote: > doing. But those threads might be using shadow stack instructions > (INCSSP, RSTORSSP, etc). These are a collection of instructions that > allow limited control of the SSP. When shadow stack gets disabled, > these suddenly turn into #UD generating instructions. So any other > threads executing those instructions when shadow stack got disabled > would be in for a nasty surprise. > Glibc's permissive mode (that disables shadow stack when dlopen()ing a > DSO that doesn't support shadow stack) is quite limited because of > this. There was a POC for working around it, but I'll stop there for > now, to not spam you with the details. I'm not sure of arm and risc-v > details on this specific corner, but for x86. We have the same issue with disabling GCS causing GCS instructions to become undefined.
On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote: > On Tue, 2024-02-20 at 18:54 -0500, dalias@libc.org wrote: > > On Tue, Feb 20, 2024 at 11:30:22PM +0000, Edgecombe, Rick P wrote: > > > On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote: > > > > On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P > > > > > Shadow stacks currently have automatic guard gaps to try to > > > > > prevent > > > > > one > > > > > thread from overflowing onto another thread's shadow stack. > > > > > This > > > > > would > > > > > somewhat opens that up, as the stack guard gaps are usually > > > > > maintained > > > > > by userspace for new threads. It would have to be thought > > > > > through > > > > > if > > > > > these could still be enforced with checking at additional > > > > > spots. > > > > > > > > I would think the existing guard pages would already do that if a > > > > thread's shadow stack is contiguous with its own data stack. > > > > > > The difference is that the kernel provides the guard gaps, where > > > this > > > would rely on userspace to do it. It is not a showstopper either. > > > > > > I think my biggest question on this is how does it change the > > > capability for two threads to share a shadow stack. It might > > > require > > > some special rules around the syscall that writes restore tokens. > > > So > > > I'm not sure. It probably needs a POC. > > > > Why would they be sharing a shadow stack? > > The guard gap was introduced originally based on a suggestion that > overflowing a shadow stack onto an adjacent shadow stack could cause > corruption that could be used by an attacker to work around the > protection. There wasn't any concrete demonstrated attacks or > suggestion that all the protection was moot. OK, so not sharing, just happening to be adjacent. I was thinking from a standpoint of allocating them as part of the same range as the main stack, just with different protections, where that would never happen; you'd always have intervening non-shadowstack pages. But when they're kernel-allocated, yes, they need their own guard pages. > But when we talk about capabilities for converting memory to shadow > stack with simple memory accesses, and syscalls that can write restore > token to shadow stacks, it's not immediately clear to me that it > wouldn't open up something like that. Like if two restore tokens were > written to a shadow stack, or two shadow stacks were adjacent with > normal memory between them that later got converted to shadow stack. > Those sorts of scenarios, but I won't lean on those specific examples. > Sorry for being hand wavy. It's just where I'm at, at this point. I don't think it's safe to have automatic conversions back and forth, only for normal accesses to convert shadowstack to normal memory (in which case, any subsequent attempt to operate on it as shadow stack indicates a critical bug and should be trapped to terminate the process). > > > > From the musl side, I have always looked at the entirely of > > > > shadow > > > > stack stuff with very heavy skepticism, and anything that breaks > > > > existing interface contracts, introduced places where apps can > > > > get > > > > auto-killed because a late resource allocation fails, or requires > > > > applications to code around the existence of something that > > > > should be > > > > an implementation detail, is a non-starter. To even consider > > > > shadow > > > > stack support, it must truely be fully non-breaking. > > > > > > The manual assembly stack switching and JIT code in the apps needs > > > to > > > be updated. I don't think there is a way around it. > > > > Indeed, I'm not talking about programs with JIT/manual stack- > > switching > > asm, just anything using existing APIs for control of stack -- > > pthread_setstack, makecontext, sigaltstack, etc. > > Then I think WRSS might fit your requirements better than what glibc > did. It was considered a reduced security mode that made libc's job > much easier and had better compatibility, but the last discussion was > to try to do it without WRSS. Where can I read more about this? Some searches I tried didn't turn up much useful information. > > > I agree though that the late allocation failures are not great. > > > Mark is > > > working on clone3 support which should allow moving the shadow > > > stack > > > allocation to happen in userspace with the normal stack. Even for > > > riscv > > > though, doesn't it need to update a new register in stack > > > switching? > > > > If clone is called with signals masked, it's probably not necessary > > for the kernel to set the shadow stack register as part of clone3. > > So you would want a mode of clone3 that basically leaves the shadow > stack bits alone? Mark was driving that effort, but it doesn't seem > horrible to me on first impression. If it would open up the possibility > of musl support. Well I'm not sure. That's what we're trying to figure out. But I don't think modifying it is a hard requirement, since it can be modified from userspace if needed as long as signals are masked. > > One reasonable thing to do, that might be preferable to > > overengineered > > solutions, is to disable shadow-stack process-wide if an interface > > incompatible with it is used (sigaltstack, pthread_create with an > > attribute setup using pthread_attr_setstack, makecontext, etc.), as > > well as if an incompatible library is is dlopened. > > I think it would be an interesting approach to determining > compatibility. On x86 there has been cases of binaries getting > mismarked as supporting shadow stack. So an automated way of filtering > some of those out would be very useful I think. I guess the dynamic > linker could determine this based on some list of functions? I didn't follow this whole mess, but from our side (musl) it does not seem relevant. There are no legacy binaries wrongly marked because we have never supported shadow stacks so far. > The dlopen() bit gets complicated though. You need to disable shadow > stack for all threads, which presumably the kernel could be coaxed into > doing. But those threads might be using shadow stack instructions > (INCSSP, RSTORSSP, etc). These are a collection of instructions that > allow limited control of the SSP. When shadow stack gets disabled, > these suddenly turn into #UD generating instructions. So any other > threads executing those instructions when shadow stack got disabled > would be in for a nasty surprise. This is the kernel's problem if that's happening. It should be trapping these and returning immediately like a NOP if shadow stack has been disabled, not generating SIGILL. > > The place where it's really needed to be able to allocate the shadow > > stack synchronously under userspace control, in order to harden > > normal > > applications that aren't doing funny things, is in pthread_create > > without a caller-provided stack. > > Yea most apps don't do anything too tricky. Mostly shadow stack "just > works". But it's no excuse to just crash for the others. One thing to note here is that, to enable this, we're going to need some way to detect "new enough kernel that shadow stack semantics are all right". If there are kernels that have shadow stack support but with problems that make it unsafe to use (this sounds like the case), we can't turn it on without a way to avoid trying to use it on those. Rich
On Tue, 2024-02-20 at 20:27 -0500, dalias@libc.org wrote: > > Then I think WRSS might fit your requirements better than what > > glibc > > did. It was considered a reduced security mode that made libc's job > > much easier and had better compatibility, but the last discussion > > was > > to try to do it without WRSS. > > Where can I read more about this? Some searches I tried didn't turn > up > much useful information. There never was any proposal written down AFAIK. In the past we have had a couple "shadow stack meetup" calls where folks who are working on shadow stack got together to hash out some things. We discussed it there. But briefly, in the Intel SDM (and other places) there is documentation on the special shadow stack instructions. The two key ones for this are WRSS and RSTORSSP. WRSS is an instruction which can be enabled by the kernel (and there is upstream support for this). The instruction can write through shadow stack memory. RSTORSSP can be used to consume a restore token, which is a special value on the shadow stack. When this operations happens the SSP is moved adjacent to the token that was just consumed. So between the two of them the SSP can be adjusted to specific spots on the shadow stack or another shadow stack. Today when you longjmp() with shadow stack in glibc, INCSSP is used to move the SSP back to the spot on the shadow stack where the setjmp() was called. But this algorithm doesn't always work, for example, longjmp()ing between stacks. To work around this glibc uses a scheme where it searches from the target SSP for a shadow stack token and then consumes it and INCSSPs back to the target SPP. It just barely miraculously worked in most cases. Some specific cases that were still open were longjmp()ing off of a custom userspace threading library stack, which may not have left a token behind when it jumped to a new stack. And also, potentially off of an alt shadow stack in the future, depending on whether it leaves a restore token when handling a signal. (the problem there, is if there is no room to leave it). So that is how x86 glic works, and I think arm was thinking along the same lines. But if you have WRSS (and arm's version), you could just write a restore token or anything else you need to fixup on the shadow stack. Then you could longjmp() in one go without any high wire acts. It's much simpler and more robust and would prevent needing to leave a restore token when handling a signal to an alt shadow stack. Although, nothing was ever prototyped. So "in theory". But that is all about moving the SSP where you need it. It doesn't resolve any of the allocation lifecycle issues. I think for those the solutions are: 1. Not supporting ucontext/sigaltstack and shadow stack 2. Stefan's idea 3. A new interface that takes user allocated shadow stacks for those operations My preference has been a combination of 1 and 3. For threads, I think Mark's clone3 enhancements will help. Anyway, there is an attempt at a summary. I'd also point you to HJ for more glibc context, as I mostly worked on the kernel side.
On Tue, 2024-02-20 at 18:11 -0800, Rick Edgecombe wrote: > Some specific cases that were still open were longjmp()ing off of a > custom userspace threading library stack, which may not have left a > token behind when it jumped to a new stack. And also, potentially off > of an alt shadow stack in the future, depending on whether it leaves > a > restore token when handling a signal. (the problem there, is if there > is no room to leave it). Ah, I remember the other one. If the token on the target shadow stack is at the end of the shadow stack, it may not be able to handle pushing a shadow stack signal frame if a signal hits while is unwinding through the token. As in, where normal longjmp() is direct transition, in this case the longjmp() operation can be temporarily in a place where a signal cannot be handled.
On Tue, 2024-02-20 at 18:59 -0500, Stefan O'Rear wrote: > > Ideally for riscv only writes would cause conversion, an incssp > underflow > which performs shadow stack reads would be able to fault early. Why can't makecontext() just clobber part of the low address side of the passed in stack with a shadow stack mapping? Like say it just munmap()'s part of the passed stack, and map_shadow_stack() in it's place. Then you could still have the shadow stack->normal conversion process triggered by normal writes. IIUC the concern there is to make sure the caller can reuse it as normal memory when it is done with the ucontext/sigaltstack stuff? So the normal->shadow stack part could be explicit. But the more I think about this, the more I think it is a hack, and a proper fix is to use new interfaces. It also would be difficult to sell, if the faulting conversion stuff is in any way complex.
On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote: > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote: > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that > > allow limited control of the SSP. When shadow stack gets disabled, > > these suddenly turn into #UD generating instructions. So any other > > threads executing those instructions when shadow stack got disabled > > would be in for a nasty surprise. > This is the kernel's problem if that's happening. It should be > trapping these and returning immediately like a NOP if shadow stack > has been disabled, not generating SIGILL. I'm not sure that's going to work out well, all it takes is some code that's looking at the shadow stack and expecting something to happen as a result of the instructions it's executing and we run into trouble. A lot of things won't notice and will just happily carry on but I expect there are going to be things that care. We also end up with an additional state for threads that have had shadow stacks transparently disabled, that's managable but still. > > > The place where it's really needed to be able to allocate the shadow > > > stack synchronously under userspace control, in order to harden > > > normal > > > applications that aren't doing funny things, is in pthread_create > > > without a caller-provided stack. > > Yea most apps don't do anything too tricky. Mostly shadow stack "just > > works". But it's no excuse to just crash for the others. > One thing to note here is that, to enable this, we're going to need > some way to detect "new enough kernel that shadow stack semantics are > all right". If there are kernels that have shadow stack support but > with problems that make it unsafe to use (this sounds like the case), > we can't turn it on without a way to avoid trying to use it on those. If we have this automatic conversion of pages to shadow stack then we should have an API for enabling it, userspace should be able to use the presence of that API to determine if the feature is there.
On Wed, Feb 21, 2024 at 01:53:10PM +0000, Mark Brown wrote: > On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote: > > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote: > > > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that > > > allow limited control of the SSP. When shadow stack gets disabled, > > > these suddenly turn into #UD generating instructions. So any other > > > threads executing those instructions when shadow stack got disabled > > > would be in for a nasty surprise. > > > This is the kernel's problem if that's happening. It should be > > trapping these and returning immediately like a NOP if shadow stack > > has been disabled, not generating SIGILL. > > I'm not sure that's going to work out well, all it takes is some code > that's looking at the shadow stack and expecting something to happen as > a result of the instructions it's executing and we run into trouble. A > lot of things won't notice and will just happily carry on but I expect > there are going to be things that care. We also end up with an > additional state for threads that have had shadow stacks transparently > disabled, that's managable but still. I said NOP but there's no reason it strictly needs to be a NOP. It could instead do something reasonable to convey the state of racing with shadow stack being disabled. > > > > > The place where it's really needed to be able to allocate the shadow > > > > stack synchronously under userspace control, in order to harden > > > > normal > > > > applications that aren't doing funny things, is in pthread_create > > > > without a caller-provided stack. > > > > Yea most apps don't do anything too tricky. Mostly shadow stack "just > > > works". But it's no excuse to just crash for the others. > > > One thing to note here is that, to enable this, we're going to need > > some way to detect "new enough kernel that shadow stack semantics are > > all right". If there are kernels that have shadow stack support but > > with problems that make it unsafe to use (this sounds like the case), > > we can't turn it on without a way to avoid trying to use it on those. > > If we have this automatic conversion of pages to shadow stack then we > should have an API for enabling it, userspace should be able to use the > presence of that API to determine if the feature is there. Yes, or if a new prctl is needed to make disabling safe (see above) that could probably be used. Rich
On Wed, Feb 21, 2024 at 09:58:01AM -0500, dalias@libc.org wrote: > On Wed, Feb 21, 2024 at 01:53:10PM +0000, Mark Brown wrote: > > On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote: > > > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote: > > > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that > > > > allow limited control of the SSP. When shadow stack gets disabled, > > > > these suddenly turn into #UD generating instructions. So any other > > > > threads executing those instructions when shadow stack got disabled > > > > would be in for a nasty surprise. > > > This is the kernel's problem if that's happening. It should be > > > trapping these and returning immediately like a NOP if shadow stack > > > has been disabled, not generating SIGILL. > > I'm not sure that's going to work out well, all it takes is some code > > that's looking at the shadow stack and expecting something to happen as > > a result of the instructions it's executing and we run into trouble. A > I said NOP but there's no reason it strictly needs to be a NOP. It > could instead do something reasonable to convey the state of racing > with shadow stack being disabled. This feels like it's getting complicated and I fear it may be an uphill struggle to get such code merged, at least for arm64. My instinct is that it's going to be much more robust and generally tractable to let things run to some suitable synchronisation point and then disable there, but if we're going to do that then userspace can hopefully arrange to do the disabling itself through the standard disable interface anyway. Presumably it'll want to notice things being disabled at some point anyway? TBH that's been how all the prior proposals for process wide disable I've seen were done.
On Wed, Feb 21, 2024 at 05:36:12PM +0000, Mark Brown wrote: > On Wed, Feb 21, 2024 at 09:58:01AM -0500, dalias@libc.org wrote: > > On Wed, Feb 21, 2024 at 01:53:10PM +0000, Mark Brown wrote: > > > On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote: > > > > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote: > > > > > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that > > > > > allow limited control of the SSP. When shadow stack gets disabled, > > > > > these suddenly turn into #UD generating instructions. So any other > > > > > threads executing those instructions when shadow stack got disabled > > > > > would be in for a nasty surprise. > > > > > This is the kernel's problem if that's happening. It should be > > > > trapping these and returning immediately like a NOP if shadow stack > > > > has been disabled, not generating SIGILL. > > > > I'm not sure that's going to work out well, all it takes is some code > > > that's looking at the shadow stack and expecting something to happen as > > > a result of the instructions it's executing and we run into trouble. A > > > I said NOP but there's no reason it strictly needs to be a NOP. It > > could instead do something reasonable to convey the state of racing > > with shadow stack being disabled. > > This feels like it's getting complicated and I fear it may be an uphill > struggle to get such code merged, at least for arm64. My instinct is > that it's going to be much more robust and generally tractable to let > things run to some suitable synchronisation point and then disable > there, but if we're going to do that then userspace can hopefully > arrange to do the disabling itself through the standard disable > interface anyway. Presumably it'll want to notice things being disabled > at some point anyway? TBH that's been how all the prior proposals for > process wide disable I've seen were done. If it's possible to disable per-thread rather than per-process, some things are easier. Disabling on account of using alt stacks only needs to be done on the threads using those stacks. However, for dlopen purposes you need a way to disable shadow stack for the whole process. Initially this is only needed for the thread that called dlopen, but it needs to have propagated to any thread that synchronizes with completion of the call to dlopen by the time that synchronization occurs, and since that synchronization can happen in lots of different ways that are purely userspace (thanks to futexes being userspace in the uncontended case), I don't see any way to make it work without extremely invasive, high-cost checks. If folks on the kernel side are not going to be amenable to doing the things that are easy for the kernel to make it work without breaking compatibility with existing interfaces, but that are impossible or near-impossible for userspace to do, this seems like a dead-end. And I suspect an operation to "disable shadow stack, but without making threads still in SS-critical sections crash" is going to be necessary.. Rich
On Wed, 2024-02-21 at 12:57 -0500, dalias@libc.org wrote: > > This feels like it's getting complicated and I fear it may be an > > uphill > > struggle to get such code merged, at least for arm64. My instinct > > is > > that it's going to be much more robust and generally tractable to > > let > > things run to some suitable synchronisation point and then disable > > there, but if we're going to do that then userspace can hopefully > > arrange to do the disabling itself through the standard disable > > interface anyway. Presumably it'll want to notice things being > > disabled > > at some point anyway? TBH that's been how all the prior proposals > > for > > process wide disable I've seen were done. > > If it's possible to disable per-thread rather than per-process, some > things are easier. Disabling on account of using alt stacks only > needs > to be done on the threads using those stacks. However, for dlopen > purposes you need a way to disable shadow stack for the whole > process. > Initially this is only needed for the thread that called dlopen, but > it needs to have propagated to any thread that synchronizes with > completion of the call to dlopen by the time that synchronization > occurs, and since that synchronization can happen in lots of > different > ways that are purely userspace (thanks to futexes being userspace in > the uncontended case), I don't see any way to make it work without > extremely invasive, high-cost checks. For glibc's use, we talked about a couple of options. 1. A mode to start suppressing the #UD's from the shadow stack instructions 2. A mode to start suppressing #CPs (the exception that happens when the shadow stack doesn't match). So the shadow stack instructions continue to operate normally, but if the shadow stack gets mismatched due to lack of support, the ret is emulated. It probably is safer (but still not perfect), but the performance penalty of emulating every RET after things get screwed up would be a significant down side. There also needs to be clean handling of shadow stack #PFs. 3. Per-thread locking that is used around all shadow stack operations that could be sensitive to disabling. This could be maybe exposed to apps in case they want to use shadow stack instructions manually. Then during dlopen() it waits until it can cleanly disable shadow stack for each thread. In each critical sections there are checks for whether shadow stack is still enabled. 3 is the cleanest and safest I think, and it was thought it might not need kernel help, due to a scheme Florian had to direct signals to specific threads. It's my preference at this point. 1 and 2 are POCed here, if you are interested: https://github.com/rpedgeco/linux/commits/shstk_suppress_rfc/ > > If folks on the kernel side are not going to be amenable to doing the > things that are easy for the kernel to make it work without breaking > compatibility with existing interfaces, but that are impossible or > near-impossible for userspace to do, this seems like a dead-end. And > I > suspect an operation to "disable shadow stack, but without making > threads still in SS-critical sections crash" is going to be > necessary.. I think we have to work through all the alternative before we can accuse the kernel of not being amenable. Is there something that you would like to see out of this conversation that is not happening?
On Wed, Feb 21, 2024 at 06:12:30PM +0000, Edgecombe, Rick P wrote: > On Wed, 2024-02-21 at 12:57 -0500, dalias@libc.org wrote: > > > This feels like it's getting complicated and I fear it may be an > > > uphill > > > struggle to get such code merged, at least for arm64. My instinct > > > is > > > that it's going to be much more robust and generally tractable to > > > let > > > things run to some suitable synchronisation point and then disable > > > there, but if we're going to do that then userspace can hopefully > > > arrange to do the disabling itself through the standard disable > > > interface anyway. Presumably it'll want to notice things being > > > disabled > > > at some point anyway? TBH that's been how all the prior proposals > > > for > > > process wide disable I've seen were done. > > > > If it's possible to disable per-thread rather than per-process, some > > things are easier. Disabling on account of using alt stacks only > > needs > > to be done on the threads using those stacks. However, for dlopen > > purposes you need a way to disable shadow stack for the whole > > process. > > Initially this is only needed for the thread that called dlopen, but > > it needs to have propagated to any thread that synchronizes with > > completion of the call to dlopen by the time that synchronization > > occurs, and since that synchronization can happen in lots of > > different > > ways that are purely userspace (thanks to futexes being userspace in > > the uncontended case), I don't see any way to make it work without > > extremely invasive, high-cost checks. > > For glibc's use, we talked about a couple of options. > 1. A mode to start suppressing the #UD's from the shadow stack > instructions > 2. A mode to start suppressing #CPs (the exception that happens when > the shadow stack doesn't match). So the shadow stack instructions > continue to operate normally, but if the shadow stack gets mismatched > due to lack of support, the ret is emulated. It probably is safer (but > still not perfect), but the performance penalty of emulating every RET > after things get screwed up would be a significant down side. There > also needs to be clean handling of shadow stack #PFs. > 3. Per-thread locking that is used around all shadow stack operations > that could be sensitive to disabling. This could be maybe exposed to > apps in case they want to use shadow stack instructions manually. Then > during dlopen() it waits until it can cleanly disable shadow stack for > each thread. In each critical sections there are checks for whether > shadow stack is still enabled. > > 3 is the cleanest and safest I think, and it was thought it might not > need kernel help, due to a scheme Florian had to direct signals to > specific threads. It's my preference at this point. The operations where the shadow stack has to be processed need to be executable from async-signal context, so this imposes a requirement to block all signals around the lock. This makes all longjmps a heavy, multi-syscall operation rather than O(1) userspace operation. I do not think this is an acceptable implementation, especially when there are clearly superior alternatives without that cost or invasiveness. > 1 and 2 are POCed here, if you are interested: > https://github.com/rpedgeco/linux/commits/shstk_suppress_rfc/ I'm not clear why 2 (suppression of #CP) is desirable at all. If shadow stack is being disabled, it should just be disabled, with minimal fault handling to paper over any racing operations at the moment it's disabled. Leaving it on with extreme slowness to make it not actually do anything does not seem useful. Is there some way folks have in mind to use option 2 to lazily disable shadow stack once the first SS-incompatible code is executed, when execution is then known not to be in the middle of a SS-critical section, instead of doing it right away? I don't see how this could work, since the SS-incompatible code could be running from a signal handler that interrupted an SS-critical section. > > If folks on the kernel side are not going to be amenable to doing the > > things that are easy for the kernel to make it work without breaking > > compatibility with existing interfaces, but that are impossible or > > near-impossible for userspace to do, this seems like a dead-end. And > > I > > suspect an operation to "disable shadow stack, but without making > > threads still in SS-critical sections crash" is going to be > > necessary.. > > I think we have to work through all the alternative before we can > accuse the kernel of not being amenable. Is there something that you > would like to see out of this conversation that is not happening? No, I was just interpreting "uphill battle". I really do not want to engage in an uphill battle for the sake of making it practical to support something that was never my goal to begin with. If I'm misreading this, or if others are willing to put the effort into that "battle", I'd be happy to be mistaken about "not amenable". Rich
On Wed, Feb 21, 2024 at 12:57:19PM -0500, dalias@libc.org wrote: > On Wed, Feb 21, 2024 at 05:36:12PM +0000, Mark Brown wrote: > > This feels like it's getting complicated and I fear it may be an uphill > > struggle to get such code merged, at least for arm64. My instinct is > > that it's going to be much more robust and generally tractable to let > > things run to some suitable synchronisation point and then disable > > there, but if we're going to do that then userspace can hopefully > > arrange to do the disabling itself through the standard disable > > interface anyway. Presumably it'll want to notice things being disabled > > at some point anyway? TBH that's been how all the prior proposals for > > process wide disable I've seen were done. > If it's possible to disable per-thread rather than per-process, some > things are easier. Disabling on account of using alt stacks only needs Both x86 and arm64 currently track shadow stack enablement per thread, not per process, so it's not just possible to do per thread it's the only thing we're currently implementing. I think the same is true for RISC-V but I didn't look as closely at that yet. > to be done on the threads using those stacks. However, for dlopen > purposes you need a way to disable shadow stack for the whole process. > Initially this is only needed for the thread that called dlopen, but > it needs to have propagated to any thread that synchronizes with > completion of the call to dlopen by the time that synchronization > occurs, and since that synchronization can happen in lots of different > ways that are purely userspace (thanks to futexes being userspace in > the uncontended case), I don't see any way to make it work without > extremely invasive, high-cost checks. Yeah, it's not particularly nice - any whole process disable is going to have some nasty cases I think. Rick's message about covered AFAIR the discussion, there were also some proposals for more limited userspaces I think. > If folks on the kernel side are not going to be amenable to doing the > things that are easy for the kernel to make it work without breaking > compatibility with existing interfaces, but that are impossible or > near-impossible for userspace to do, this seems like a dead-end. And I > suspect an operation to "disable shadow stack, but without making > threads still in SS-critical sections crash" is going to be > necessary.. Could you be more specific as to the easy things that you're referencing here?
On Wed, 2024-02-21 at 13:30 -0500, dalias@libc.org wrote: > > 3 is the cleanest and safest I think, and it was thought it might > > not > > need kernel help, due to a scheme Florian had to direct signals to > > specific threads. It's my preference at this point. > > The operations where the shadow stack has to be processed need to be > executable from async-signal context, so this imposes a requirement > to > block all signals around the lock. This makes all longjmps a heavy, > multi-syscall operation rather than O(1) userspace operation. I do > not > think this is an acceptable implementation, especially when there are > clearly superior alternatives without that cost or invasiveness. That is a good point. Could the per-thread locks be nestable to address this? We just need to know if a thread *might* be using shadow stacks. So we really just need a per-thread count. > > > 1 and 2 are POCed here, if you are interested: > > https://github.com/rpedgeco/linux/commits/shstk_suppress_rfc/ > > I'm not clear why 2 (suppression of #CP) is desirable at all. If > shadow stack is being disabled, it should just be disabled, with > minimal fault handling to paper over any racing operations at the > moment it's disabled. Leaving it on with extreme slowness to make it > not actually do anything does not seem useful. The benefit is that code that is using shadow stack instructions won't crash if it relies on them working. For example RDSSP turns into a NOP if shadow stack is disabled, and the intrinsic is written such that a NULL pointer is returned if shadow stack is disabled. The shadow stack is normally readable, and this happens in glibc sometimes. So if there was code like: long foo = *(long *)_get_ssp(); ...then it could suddenly read a NULL pointer if shadow stack got disabled. (notice, it's not even a "shadow stack access" fault-wise. So it was looked at as somewhat more robust. But neither 1 or 2 are perfect for apps that are manually using shadow stack instructions. > > Is there some way folks have in mind to use option 2 to lazily > disable > shadow stack once the first SS-incompatible code is executed, when > execution is then known not to be in the middle of a SS-critical > section, instead of doing it right away? I don't see how this could > work, since the SS-incompatible code could be running from a signal > handler that interrupted an SS-critical section. The idea was to disable it without critical sections, and it could be more robust, but not perfect. I was preferring option 1 between 1 and 2, which was closer to your original suggestion. But it has problems like the example I gave above. I agree 1 is relatively simpler for the kernel, between 1 and 2. > > > > If folks on the kernel side are not going to be amenable to doing > > > the > > > things that are easy for the kernel to make it work without > > > breaking > > > compatibility with existing interfaces, but that are impossible > > > or > > > near-impossible for userspace to do, this seems like a dead-end. > > > And > > > I > > > suspect an operation to "disable shadow stack, but without making > > > threads still in SS-critical sections crash" is going to be > > > necessary.. > > > > I think we have to work through all the alternative before we can > > accuse the kernel of not being amenable. Is there something that > > you > > would like to see out of this conversation that is not happening? > > No, I was just interpreting "uphill battle". I really do not want to > engage in an uphill battle for the sake of making it practical to > support something that was never my goal to begin with. If I'm > misreading this, or if others are willing to put the effort into that > "battle", I'd be happy to be mistaken about "not amenable". I don't think x86 maintainers have put a foot down on anything around this at least. They would normally have concerns about complexity and maintainability. So if we have something that has lower value (imperfect solution), and high complexity, it starts to look like less promising path.
On Wed, Feb 21, 2024 at 06:53:44PM +0000, Edgecombe, Rick P wrote: > On Wed, 2024-02-21 at 13:30 -0500, dalias@libc.org wrote: > > > 3 is the cleanest and safest I think, and it was thought it might > > > not > > > need kernel help, due to a scheme Florian had to direct signals to > > > specific threads. It's my preference at this point. > > > > The operations where the shadow stack has to be processed need to be > > executable from async-signal context, so this imposes a requirement > > to > > block all signals around the lock. This makes all longjmps a heavy, > > multi-syscall operation rather than O(1) userspace operation. I do > > not > > think this is an acceptable implementation, especially when there are > > clearly superior alternatives without that cost or invasiveness. > > That is a good point. Could the per-thread locks be nestable to address > this? We just need to know if a thread *might* be using shadow stacks. > So we really just need a per-thread count. Due to arbitrarily nestable signal frames, no, this does not suffice. An interrupted operation using the lock could be arbitrarily delayed, even never execute again, making any call to dlopen deadlock. > > > 1 and 2 are POCed here, if you are interested: > > > https://github.com/rpedgeco/linux/commits/shstk_suppress_rfc/ > > > > I'm not clear why 2 (suppression of #CP) is desirable at all. If > > shadow stack is being disabled, it should just be disabled, with > > minimal fault handling to paper over any racing operations at the > > moment it's disabled. Leaving it on with extreme slowness to make it > > not actually do anything does not seem useful. > > The benefit is that code that is using shadow stack instructions won't > crash if it relies on them working. For example RDSSP turns into a NOP > if shadow stack is disabled, and the intrinsic is written such that a > NULL pointer is returned if shadow stack is disabled. The shadow stack > is normally readable, and this happens in glibc sometimes. So if there > was code like: > > long foo = *(long *)_get_ssp(); > > ...then it could suddenly read a NULL pointer if shadow stack got > disabled. (notice, it's not even a "shadow stack access" fault-wise. So > it was looked at as somewhat more robust. But neither 1 or 2 are > perfect for apps that are manually using shadow stack instructions. It's fine to turn RDSSP into an actual emulated read of the SSP, or at least an emulated load of zero so that uninitialized data is not left in the target register. If doing the latter, code working with the shadow stack just needs to be prepared for the possibility that it could be async-disabled, and check the return value. I have not looked at all the instructions that become #UD but I suspect they all have reasonable trivial ways to implement a "disabled" version of them that userspace can act upon reasonably. Rich
On Wed, Feb 21, 2024 at 06:32:20PM +0000, Mark Brown wrote: > On Wed, Feb 21, 2024 at 12:57:19PM -0500, dalias@libc.org wrote: > > On Wed, Feb 21, 2024 at 05:36:12PM +0000, Mark Brown wrote: > > > > This feels like it's getting complicated and I fear it may be an uphill > > > struggle to get such code merged, at least for arm64. My instinct is > > > that it's going to be much more robust and generally tractable to let > > > things run to some suitable synchronisation point and then disable > > > there, but if we're going to do that then userspace can hopefully > > > arrange to do the disabling itself through the standard disable > > > interface anyway. Presumably it'll want to notice things being disabled > > > at some point anyway? TBH that's been how all the prior proposals for > > > process wide disable I've seen were done. > > > If it's possible to disable per-thread rather than per-process, some > > things are easier. Disabling on account of using alt stacks only needs > > Both x86 and arm64 currently track shadow stack enablement per thread, > not per process, so it's not just possible to do per thread it's the > only thing we're currently implementing. I think the same is true for > RISC-V but I didn't look as closely at that yet. That's nice! It allows still keeping part of the benefit of SS in programs which have some threads running with custom stacks. We do however need a global-disable option for dlopen. In musl this could be done via the same mechanism ("synccall") used for set*id -- it's basically userspace IPI. But just having a native operation would be nicer, and would probably help glibc where I don't think they abstracted their set*id mechanism to do other things like this. > > If folks on the kernel side are not going to be amenable to doing the > > things that are easy for the kernel to make it work without breaking > > compatibility with existing interfaces, but that are impossible or > > near-impossible for userspace to do, this seems like a dead-end. And I > > suspect an operation to "disable shadow stack, but without making > > threads still in SS-critical sections crash" is going to be > > necessary.. > > Could you be more specific as to the easy things that you're referencing > here? Basically the ARCH_SHSTK_SUPPRESS_UD proposal. Rich
On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote: > Due to arbitrarily nestable signal frames, no, this does not suffice. > An interrupted operation using the lock could be arbitrarily delayed, > even never execute again, making any call to dlopen deadlock. Doh! Yep, it is not robust to this. The only thing that could be done would be a timeout in dlopen(). Which would make the whole thing just better than nothing. > > > > > It's fine to turn RDSSP into an actual emulated read of the SSP, or > at > least an emulated load of zero so that uninitialized data is not left > in the target register. We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer x86-only knowledge). > If doing the latter, code working with the > shadow stack just needs to be prepared for the possibility that it > could be async-disabled, and check the return value. > > I have not looked at all the instructions that become #UD but I > suspect they all have reasonable trivial ways to implement a > "disabled" version of them that userspace can act upon reasonably. This would have to be thought through functionally and performance wise. I'm not opposed if can come up with a fully fleshed out plan. How serious are you in pursuing musl support, if we had something like this? HJ, any thoughts on whether glibc would use this as well? It is probably worth mentioning that from the security side (as Mark mentioned there is always tension in the tradeoffs on these features), permissive mode is seen by some as something that weakens security too much. Apps could call dlopen() on a known unsupported DSO before doing ROP. I don't know if you have any musl users with specific shadow stack use cases to ask about this.
On Wed, Feb 21, 2024 at 07:22:21PM +0000, Edgecombe, Rick P wrote: > On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote: > > Due to arbitrarily nestable signal frames, no, this does not suffice. > > An interrupted operation using the lock could be arbitrarily delayed, > > even never execute again, making any call to dlopen deadlock. > > Doh! Yep, it is not robust to this. The only thing that could be done > would be a timeout in dlopen(). Which would make the whole thing just > better than nothing. > > > > > > > > > > It's fine to turn RDSSP into an actual emulated read of the SSP, or > > at > > least an emulated load of zero so that uninitialized data is not left > > in the target register. > > We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer > x86-only knowledge). OK, then I think the contract just has to be that userspace, in a process that might dynamically disable shadow stack, needs to do something like xor %reg,%reg before rdssp so that the outcome is deterministic in disabled case. > > If doing the latter, code working with the > > shadow stack just needs to be prepared for the possibility that it > > could be async-disabled, and check the return value. > > > > I have not looked at all the instructions that become #UD but I > > suspect they all have reasonable trivial ways to implement a > > "disabled" version of them that userspace can act upon reasonably. > > This would have to be thought through functionally and performance > wise. I'm not opposed if can come up with a fully fleshed out plan. How > serious are you in pursuing musl support, if we had something like > this? Up til this thread, my position was pretty much "nope" because it looked like it could not be done in a way compatible with existing interface requirements. However, what's been discussed here, contingent on a dynamic-disable (ideally allowing choice of per-thread or global, to minimize loss of hardening properties), Personally, I believe shadow stack has very low hardening value relative to cost/complexity, and my leaning would be just to ignore it. However, I also know it becomes marketing pressure, including pressure on distros that use musl -- "Why doesn't [distro] do shadow stack?? I thought you were security oriented!!!" -- and if it can be done in a non-breaking and non-invasive way, I think it's reasonable to pursue and make something work. > HJ, any thoughts on whether glibc would use this as well? > > It is probably worth mentioning that from the security side (as Mark > mentioned there is always tension in the tradeoffs on these features), > permissive mode is seen by some as something that weakens security too > much. Apps could call dlopen() on a known unsupported DSO before doing > ROP. I don't know if you have any musl users with specific shadow stack > use cases to ask about this. Yes, this is potentially an argument for something like the option 2, if there's a way to leave SS enabled but then trap when something goes wrong, detect if it went wrong via SS-incompatible library code, and lazily disable SS, otherwise terminate. But I just realized, I'm not even sure why shared libraries need to be explicitly SS-compatible. Unless they're doing their own asm stack switches, shouldn't they just work by default? And since I don't understand this reason, I also don't understand what the failure mode is when an incompatible library is loaded, and thus whether it would be possible to detect and attribute the failure to the library, or whether the library would induce failure somewhere else. Anyway, a mechanism to allow the userspace implementation to disable SS doesn't inherently expose a means to do that. A system integrator doing maximum hardening might choose to build all libraries as SS-compatible, or to patch the loader to refuse to load incompatible libraries. Rich
On Wed, Feb 21, 2024 at 11:22 AM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote: > > Due to arbitrarily nestable signal frames, no, this does not suffice. > > An interrupted operation using the lock could be arbitrarily delayed, > > even never execute again, making any call to dlopen deadlock. > > Doh! Yep, it is not robust to this. The only thing that could be done > would be a timeout in dlopen(). Which would make the whole thing just > better than nothing. > > > > > > > > > > It's fine to turn RDSSP into an actual emulated read of the SSP, or > > at > > least an emulated load of zero so that uninitialized data is not left > > in the target register. > > We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer > x86-only knowledge). > > > If doing the latter, code working with the > > shadow stack just needs to be prepared for the possibility that it > > could be async-disabled, and check the return value. > > > > I have not looked at all the instructions that become #UD but I > > suspect they all have reasonable trivial ways to implement a > > "disabled" version of them that userspace can act upon reasonably. > > This would have to be thought through functionally and performance > wise. I'm not opposed if can come up with a fully fleshed out plan. How > serious are you in pursuing musl support, if we had something like > this? > > HJ, any thoughts on whether glibc would use this as well? Assuming that we are talking about permissive mode, if kernel can suppress UD, we don't need to disable SHSTK. Glibc can enable ARCH_SHSTK_SUPPRESS_UD instead. > It is probably worth mentioning that from the security side (as Mark > mentioned there is always tension in the tradeoffs on these features), > permissive mode is seen by some as something that weakens security too > much. Apps could call dlopen() on a known unsupported DSO before doing > ROP. I don't know if you have any musl users with specific shadow stack > use cases to ask about this.
On Wed, Feb 21, 2024 at 12:18 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Wed, Feb 21, 2024 at 11:22 AM Edgecombe, Rick P > <rick.p.edgecombe@intel.com> wrote: > > > > On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote: > > > Due to arbitrarily nestable signal frames, no, this does not suffice. > > > An interrupted operation using the lock could be arbitrarily delayed, > > > even never execute again, making any call to dlopen deadlock. > > > > Doh! Yep, it is not robust to this. The only thing that could be done > > would be a timeout in dlopen(). Which would make the whole thing just > > better than nothing. > > > > > > > > > > > > > > > It's fine to turn RDSSP into an actual emulated read of the SSP, or > > > at > > > least an emulated load of zero so that uninitialized data is not left > > > in the target register. > > > > We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer > > x86-only knowledge). > > > > > If doing the latter, code working with the > > > shadow stack just needs to be prepared for the possibility that it > > > could be async-disabled, and check the return value. > > > > > > I have not looked at all the instructions that become #UD but I > > > suspect they all have reasonable trivial ways to implement a > > > "disabled" version of them that userspace can act upon reasonably. > > > > This would have to be thought through functionally and performance > > wise. I'm not opposed if can come up with a fully fleshed out plan. How > > serious are you in pursuing musl support, if we had something like > > this? > > > > HJ, any thoughts on whether glibc would use this as well? > > Assuming that we are talking about permissive mode, if kernel can > suppress UD, we don't need to disable SHSTK. Glibc can enable > ARCH_SHSTK_SUPPRESS_UD instead. Kernel must suppress all possible SHSTK UDs. > > It is probably worth mentioning that from the security side (as Mark > > mentioned there is always tension in the tradeoffs on these features), > > permissive mode is seen by some as something that weakens security too > > much. Apps could call dlopen() on a known unsupported DSO before doing > > ROP. I don't know if you have any musl users with specific shadow stack > > use cases to ask about this. > > > > -- > H.J.
On Wed, Feb 21, 2024 at 12:25 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Wed, Feb 21, 2024 at 12:18 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Wed, Feb 21, 2024 at 11:22 AM Edgecombe, Rick P > > <rick.p.edgecombe@intel.com> wrote: > > > > > > On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote: > > > > Due to arbitrarily nestable signal frames, no, this does not suffice. > > > > An interrupted operation using the lock could be arbitrarily delayed, > > > > even never execute again, making any call to dlopen deadlock. > > > > > > Doh! Yep, it is not robust to this. The only thing that could be done > > > would be a timeout in dlopen(). Which would make the whole thing just > > > better than nothing. > > > > > > > > > > > > > > > > > > > > It's fine to turn RDSSP into an actual emulated read of the SSP, or > > > > at > > > > least an emulated load of zero so that uninitialized data is not left > > > > in the target register. > > > > > > We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer > > > x86-only knowledge). > > > > > > > If doing the latter, code working with the > > > > shadow stack just needs to be prepared for the possibility that it > > > > could be async-disabled, and check the return value. > > > > > > > > I have not looked at all the instructions that become #UD but I > > > > suspect they all have reasonable trivial ways to implement a > > > > "disabled" version of them that userspace can act upon reasonably. > > > > > > This would have to be thought through functionally and performance > > > wise. I'm not opposed if can come up with a fully fleshed out plan. How > > > serious are you in pursuing musl support, if we had something like > > > this? > > > > > > HJ, any thoughts on whether glibc would use this as well? > > > > Assuming that we are talking about permissive mode, if kernel can > > suppress UD, we don't need to disable SHSTK. Glibc can enable > > ARCH_SHSTK_SUPPRESS_UD instead. > > Kernel must suppress all possible SHSTK UDs. If SHSTK is disabled by kernel, not by glibc, there can be 2 issues: 1. Glibc and kernel may be out of sync on SHSTK. 2. When kernel disables SHSTK, glibc may be in the middle of reading shadow stack in longjmp, searching for a restore token. > > > It is probably worth mentioning that from the security side (as Mark > > > mentioned there is always tension in the tradeoffs on these features), > > > permissive mode is seen by some as something that weakens security too > > > much. Apps could call dlopen() on a known unsupported DSO before doing > > > ROP. I don't know if you have any musl users with specific shadow stack > > > use cases to ask about this. > > > > > > > > -- > > H.J. > > > > -- > H.J.
On Wed, Feb 21, 2024 at 07:22:21PM +0000, Edgecombe, Rick P wrote: > On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote: > > It's fine to turn RDSSP into an actual emulated read of the SSP, or > > at > > least an emulated load of zero so that uninitialized data is not left > > in the target register. > We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer > x86-only knowledge). For arm64 we have a separate control GCSCRE0_EL1.nTR for access to GCSPR_EL0 (our SSP equivalent) we can use. > > I have not looked at all the instructions that become #UD but I > > suspect they all have reasonable trivial ways to implement a > > "disabled" version of them that userspace can act upon reasonably. > This would have to be thought through functionally and performance > wise. I'm not opposed if can come up with a fully fleshed out plan. How > serious are you in pursuing musl support, if we had something like > this? Same here, we have to be careful since it's defining ABI in a way that we don't normally provide ABI but if there's a clear case for doing it then...
* Mark Brown <broonie@kernel.org> [2024-02-21 17:36:12 +0000]: > On Wed, Feb 21, 2024 at 09:58:01AM -0500, dalias@libc.org wrote: > > On Wed, Feb 21, 2024 at 01:53:10PM +0000, Mark Brown wrote: > > > On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote: > > > > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote: > > > > > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that > > > > > allow limited control of the SSP. When shadow stack gets disabled, > > > > > these suddenly turn into #UD generating instructions. So any other > > > > > threads executing those instructions when shadow stack got disabled > > > > > would be in for a nasty surprise. > > > > > This is the kernel's problem if that's happening. It should be > > > > trapping these and returning immediately like a NOP if shadow stack > > > > has been disabled, not generating SIGILL. > > > > I'm not sure that's going to work out well, all it takes is some code > > > that's looking at the shadow stack and expecting something to happen as > > > a result of the instructions it's executing and we run into trouble. A > > > I said NOP but there's no reason it strictly needs to be a NOP. It > > could instead do something reasonable to convey the state of racing > > with shadow stack being disabled. > > This feels like it's getting complicated and I fear it may be an uphill > struggle to get such code merged, at least for arm64. My instinct is the aarch64 behaviour is already nop for gcs instructions when gcs is disabled. the isa was designed so async disable is possible. only x86 linux would have to emulate this. > that it's going to be much more robust and generally tractable to let > things run to some suitable synchronisation point and then disable > there, but if we're going to do that then userspace can hopefully > arrange to do the disabling itself through the standard disable > interface anyway. Presumably it'll want to notice things being disabled > at some point anyway? TBH that's been how all the prior proposals for > process wide disable I've seen were done.
On Sat, Mar 2, 2024 at 6:57 AM Szabolcs Nagy <nsz@port70.net> wrote: > > * Mark Brown <broonie@kernel.org> [2024-02-21 17:36:12 +0000]: > > > On Wed, Feb 21, 2024 at 09:58:01AM -0500, dalias@libc.org wrote: > > > On Wed, Feb 21, 2024 at 01:53:10PM +0000, Mark Brown wrote: > > > > On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote: > > > > > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote: > > > > > > > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that > > > > > > allow limited control of the SSP. When shadow stack gets disabled, > > > > > > these suddenly turn into #UD generating instructions. So any other > > > > > > threads executing those instructions when shadow stack got disabled > > > > > > would be in for a nasty surprise. > > > > > > > This is the kernel's problem if that's happening. It should be > > > > > trapping these and returning immediately like a NOP if shadow stack > > > > > has been disabled, not generating SIGILL. > > > > > > I'm not sure that's going to work out well, all it takes is some code > > > > that's looking at the shadow stack and expecting something to happen as > > > > a result of the instructions it's executing and we run into trouble. A > > > > > I said NOP but there's no reason it strictly needs to be a NOP. It > > > could instead do something reasonable to convey the state of racing > > > with shadow stack being disabled. > > > > This feels like it's getting complicated and I fear it may be an uphill > > struggle to get such code merged, at least for arm64. My instinct is > > the aarch64 behaviour is already nop > for gcs instructions when gcs is disabled. > the isa was designed so async disable is > possible. > > only x86 linux would have to emulate this. On Linux/x86, normal instructions are used to update SSP after checking SHSTK is enabled. If SHSTK is disabled in between, program behavior may be undefined.
On Sat, Mar 02, 2024 at 03:57:02PM +0100, Szabolcs Nagy wrote: > * Mark Brown <broonie@kernel.org> [2024-02-21 17:36:12 +0000]: > > > I said NOP but there's no reason it strictly needs to be a NOP. It > > > could instead do something reasonable to convey the state of racing > > > with shadow stack being disabled. > > This feels like it's getting complicated and I fear it may be an uphill > > struggle to get such code merged, at least for arm64. My instinct is > the aarch64 behaviour is already nop > for gcs instructions when gcs is disabled. > the isa was designed so async disable is > possible. Yeah, we'd need to handle GCSPR_EL0 somehow (currently it's inaccessible when GCS is disabled) and userspace would need to take care it's not doing something that could get stuck if for example a pop didn't actually *do* anything.
The arm64 Guarded Control Stack (GCS) feature provides support for hardware protected stacks of return addresses, intended to provide hardening against return oriented programming (ROP) attacks and to make it easier to gather call stacks for applications such as profiling. When GCS is active a secondary stack called the Guarded Control Stack is maintained, protected with a memory attribute which means that it can only be written with specific GCS operations. The current GCS pointer can not be directly written to by userspace. When a BL is executed the value stored in LR is also pushed onto the GCS, and when a RET is executed the top of the GCS is popped and compared to LR with a fault being raised if the values do not match. GCS operations may only be performed on GCS pages, a data abort is generated if they are not. The combination of hardware enforcement and lack of extra instructions in the function entry and exit paths should result in something which has less overhead and is more difficult to attack than a purely software implementation like clang's shadow stacks. This series implements support for use of GCS by userspace, along with support for use of GCS within KVM guests. It does not enable use of GCS by either EL1 or EL2, this will be implemented separately. Executables are started without GCS and must use a prctl() to enable it, it is expected that this will be done very early in application execution by the dynamic linker or other startup code. For dynamic linking this will be done by checking that everything in the executable is marked as GCS compatible. x86 has an equivalent feature called shadow stacks, this series depends on the x86 patches for generic memory management support for the new guarded/shadow stack page type and shares APIs as much as possible. As there has been extensive discussion with the wider community around the ABI for shadow stacks I have as far as practical kept implementation decisions close to those for x86, anticipating that review would lead to similar conclusions in the absence of strong reasoning for divergence. The main divergence I am concious of is that x86 allows shadow stack to be enabled and disabled repeatedly, freeing the shadow stack for the thread whenever disabled, while this implementation keeps the GCS allocated after disable but refuses to reenable it. This is to avoid races with things actively walking the GCS during a disable, we do anticipate that some systems will wish to disable GCS at runtime but are not aware of any demand for subsequently reenabling it. x86 uses an arch_prctl() to manage enable and disable, since only x86 and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a patch set for the equivalent RISC-V Zicfiss feature which I initially adopted fairly directly but following review feedback has been revised quite a bit. We currently maintain the x86 pattern of implicitly allocating a shadow stack for threads started with shadow stack enabled, there has been some discussion of removing this support and requiring the use of clone3() with explicit allocation of shadow stacks instead. I have no strong feelings either way, implicit allocation is not really consistent with anything else we do and creates the potential for errors around thread exit but on the other hand it is existing ABI on x86 and minimises the changes needed in userspace code. There is an open issue with support for CRIU, on x86 this required the ability to set the GCS mode via ptrace. This series supports configuring mode bits other than enable/disable via ptrace but it needs to be confirmed if this is sufficient. The series depends on support for shadow stacks in clone3(), that series includes the addition of ARCH_HAS_USER_SHADOW_STACK. https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@kernel.org It also depends on the addition of more waitpid() flags to nolibc: https://lore.kernel.org/r/20231023-nolibc-waitpid-flags-v2-1-b09d096f091f@kernel.org You can see a branch with the full set of dependencies against Linus' tree at: https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs [1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ Signed-off-by: Mark Brown <broonie@kernel.org> --- Changes in v8: - Invalidate signal cap token on stack when consuming. - Typo and other trivial fixes. - Don't try to use process_vm_write() on GCS, it intentionally does not work. - Fix leak of thread GCSs. - Rebase onto latest clone3() series. - Link to v7: https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org Changes in v7: - Rebase onto v6.7-rc2 via the clone3() patch series. - Change the token used to cap the stack during signal handling to be compatible with GCSPOPM. - Fix flags for new page types. - Fold in support for clone3(). - Replace copy_to_user_gcs() with put_user_gcs(). - Link to v6: https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org Changes in v6: - Rebase onto v6.6-rc3. - Add some more gcsb_dsync() barriers following spec clarifications. - Due to ongoing discussion around clone()/clone3() I've not updated anything there, the behaviour is the same as on previous versions. - Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org Changes in v5: - Don't map any permissions for user GCSs, we always use EL0 accessors or use a separate mapping of the page. - Reduce the standard size of the GCS to RLIMIT_STACK/2. - Enforce a PAGE_SIZE alignment requirement on map_shadow_stack(). - Clarifications and fixes to documentation. - More tests. - Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org Changes in v4: - Implement flags for map_shadow_stack() allowing the cap and end of stack marker to be enabled independently or not at all. - Relax size and alignment requirements for map_shadow_stack(). - Add more blurb explaining the advantages of hardware enforcement. - Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org Changes in v3: - Rebase onto v6.5-rc4. - Add a GCS barrier on context switch. - Add a GCS stress test. - Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org Changes in v2: - Rebase onto v6.5-rc3. - Rework prctl() interface to allow each bit to be locked independently. - map_shadow_stack() now places the cap token based on the size requested by the caller not the actual space allocated. - Mode changes other than enable via ptrace are now supported. - Expand test coverage. - Various smaller fixes and adjustments. - Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org --- Mark Brown (38): arm64/mm: Restructure arch_validate_flags() for extensibility prctl: arch-agnostic prctl for shadow stack mman: Add map_shadow_stack() flags arm64: Document boot requirements for Guarded Control Stacks arm64/gcs: Document the ABI for Guarded Control Stacks arm64/sysreg: Add definitions for architected GCS caps arm64/gcs: Add manual encodings of GCS instructions arm64/gcs: Provide put_user_gcs() arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS) arm64/mm: Allocate PIE slots for EL0 guarded control stack mm: Define VM_SHADOW_STACK for arm64 when we support GCS arm64/mm: Map pages for guarded control stack KVM: arm64: Manage GCS registers for guests arm64/gcs: Allow GCS usage at EL0 and EL1 arm64/idreg: Add overrride for GCS arm64/hwcap: Add hwcap for GCS arm64/traps: Handle GCS exceptions arm64/mm: Handle GCS data aborts arm64/gcs: Context switch GCS state for EL0 arm64/gcs: Ensure that new threads have a GCS arm64/gcs: Implement shadow stack prctl() interface arm64/mm: Implement map_shadow_stack() arm64/signal: Set up and restore the GCS context for signal handlers arm64/signal: Expose GCS state in signal frames arm64/ptrace: Expose GCS via ptrace and core files arm64: Add Kconfig for Guarded Control Stack (GCS) kselftest/arm64: Verify the GCS hwcap kselftest/arm64: Add GCS as a detected feature in the signal tests kselftest/arm64: Add framework support for GCS to signal handling tests kselftest/arm64: Allow signals tests to specify an expected si_code kselftest/arm64: Always run signals tests with GCS enabled kselftest/arm64: Add very basic GCS test program kselftest/arm64: Add a GCS test program built with the system libc kselftest/arm64: Add test coverage for GCS mode locking selftests/arm64: Add GCS signal tests kselftest/arm64: Add a GCS stress test kselftest/arm64: Enable GCS for the FP stress tests kselftest: Provide shadow stack enable helpers for arm64 Documentation/admin-guide/kernel-parameters.txt | 6 + Documentation/arch/arm64/booting.rst | 22 + Documentation/arch/arm64/elf_hwcaps.rst | 3 + Documentation/arch/arm64/gcs.rst | 233 +++++++ Documentation/arch/arm64/index.rst | 1 + Documentation/filesystems/proc.rst | 2 +- arch/arm64/Kconfig | 20 + arch/arm64/include/asm/cpufeature.h | 6 + arch/arm64/include/asm/el2_setup.h | 17 + arch/arm64/include/asm/esr.h | 28 +- arch/arm64/include/asm/exception.h | 2 + arch/arm64/include/asm/gcs.h | 107 +++ arch/arm64/include/asm/hwcap.h | 1 + arch/arm64/include/asm/kvm_arm.h | 4 +- arch/arm64/include/asm/kvm_host.h | 12 + arch/arm64/include/asm/mman.h | 23 +- arch/arm64/include/asm/pgtable-prot.h | 14 +- arch/arm64/include/asm/processor.h | 7 + arch/arm64/include/asm/sysreg.h | 20 + arch/arm64/include/asm/uaccess.h | 40 ++ arch/arm64/include/uapi/asm/hwcap.h | 1 + arch/arm64/include/uapi/asm/ptrace.h | 8 + arch/arm64/include/uapi/asm/sigcontext.h | 9 + arch/arm64/kernel/cpufeature.c | 19 + arch/arm64/kernel/cpuinfo.c | 1 + arch/arm64/kernel/entry-common.c | 23 + arch/arm64/kernel/idreg-override.c | 2 + arch/arm64/kernel/process.c | 85 +++ arch/arm64/kernel/ptrace.c | 59 ++ arch/arm64/kernel/signal.c | 242 ++++++- arch/arm64/kernel/traps.c | 11 + arch/arm64/kvm/emulate-nested.c | 4 + arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 17 + arch/arm64/kvm/sys_regs.c | 22 + arch/arm64/mm/Makefile | 1 + arch/arm64/mm/fault.c | 79 ++- arch/arm64/mm/gcs.c | 300 +++++++++ arch/arm64/mm/mmap.c | 13 +- arch/arm64/tools/cpucaps | 1 + arch/x86/include/uapi/asm/mman.h | 3 - fs/proc/task_mmu.c | 3 + include/linux/mm.h | 16 +- include/uapi/asm-generic/mman.h | 4 + include/uapi/linux/elf.h | 1 + include/uapi/linux/prctl.h | 22 + kernel/sys.c | 30 + tools/testing/selftests/arm64/Makefile | 2 +- tools/testing/selftests/arm64/abi/hwcap.c | 19 + tools/testing/selftests/arm64/fp/assembler.h | 15 + tools/testing/selftests/arm64/fp/fpsimd-test.S | 2 + tools/testing/selftests/arm64/fp/sve-test.S | 2 + tools/testing/selftests/arm64/fp/za-test.S | 2 + tools/testing/selftests/arm64/fp/zt-test.S | 2 + tools/testing/selftests/arm64/gcs/.gitignore | 5 + tools/testing/selftests/arm64/gcs/Makefile | 24 + tools/testing/selftests/arm64/gcs/asm-offsets.h | 0 tools/testing/selftests/arm64/gcs/basic-gcs.c | 428 ++++++++++++ tools/testing/selftests/arm64/gcs/gcs-locking.c | 200 ++++++ .../selftests/arm64/gcs/gcs-stress-thread.S | 311 +++++++++ tools/testing/selftests/arm64/gcs/gcs-stress.c | 532 +++++++++++++++ tools/testing/selftests/arm64/gcs/gcs-util.h | 100 +++ tools/testing/selftests/arm64/gcs/libc-gcs.c | 736 +++++++++++++++++++++ tools/testing/selftests/arm64/signal/.gitignore | 1 + .../testing/selftests/arm64/signal/test_signals.c | 17 +- .../testing/selftests/arm64/signal/test_signals.h | 6 + .../selftests/arm64/signal/test_signals_utils.c | 32 +- .../selftests/arm64/signal/test_signals_utils.h | 39 ++ .../arm64/signal/testcases/gcs_exception_fault.c | 62 ++ .../selftests/arm64/signal/testcases/gcs_frame.c | 88 +++ .../arm64/signal/testcases/gcs_write_fault.c | 67 ++ .../selftests/arm64/signal/testcases/testcases.c | 7 + .../selftests/arm64/signal/testcases/testcases.h | 1 + tools/testing/selftests/ksft_shstk.h | 37 ++ 73 files changed, 4241 insertions(+), 40 deletions(-) --- base-commit: 50abefbf1bc07f5c4e403fd28f71dcee855100f7 change-id: 20230303-arm64-gcs-e311ab0d8729 Best regards,