[RFT,v4,0/5] fork: Support shadow stacks in clone3()

Message ID	20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@kernel.org (mailing list archive)
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 15C1D3D0AB; Tue, 28 Nov 2023 18:31:35 +0000 (UTC) From: Mark Brown <broonie@kernel.org> Subject: [PATCH RFT v4 0/5] fork: Support shadow stacks in clone3() Date: Tue, 28 Nov 2023 18:22:38 +0000 Message-Id: <20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: "Rick P. Edgecombe" <rick.p.edgecombe@intel.com>, Deepak Gupta <debug@rivosinc.com>, Szabolcs Nagy <Szabolcs.Nagy@arm.com>, "H.J. Lu" <hjl.tools@gmail.com>, Florian Weimer <fweimer@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Christian Brauner <brauner@kernel.org>, Shuah Khan <shuah@kernel.org> Cc: linux-kernel@vger.kernel.org, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Kees Cook <keescook@chromium.org>, jannh@google.com, bsegall@google.com, linux-kselftest@vger.kernel.org, linux-api@vger.kernel.org, Mark Brown <broonie@kernel.org>, David Hildenbrand <david@redhat.com>
Series	fork: Support shadow stacks in clone3() \| expand [RFT,v4,0/5] fork: Support shadow stacks in clone3() [RFT,v4,1/5] mm: Introduce ARCH_HAS_USER_SHADOW_STACK [RFT,v4,2/5] fork: Add shadow stack support to clone3() [RFT,v4,3/5] selftests/clone3: Factor more of main loop into test_clone3() [RFT,v4,4/5] selftests/clone3: Allow tests to flag if -E2BIG is a valid error code [RFT,v4,5/5] kselftest/clone3: Test shadow stack support

Mark Brown Nov. 28, 2023, 6:22 p.m. UTC

The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1].  With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses.  This provides some
protection against ROP attacks and making it easier to collect call
stacks.  These shadow stacks are allocated in the address space of the
userspace process.

Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled.  The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread.  This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces.  As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.

Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process in a similar manner
to how the normal stack is specified, keeping the current implicit
allocation behaviour if one is not specified either with clone3() or
through the use of clone().  Unlike normal stacks only the shadow stack
size is specified, similar issues to those that lead to the creation of
map_shadow_stack() apply.

Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET avaible to me, I have
done testing with an integration into my pending work for GCS.  There is
some possibility that the arm64 implementation may require the use of
clone3() and explicit userspace allocation of shadow stacks, this is
still under discussion.

A new architecture feature Kconfig option for shadow stacks is added as
here, this was suggested as part of the review comments for the arm64
GCS series and since we need to detect if shadow stacks are supported it
seemed sensible to roll it in here.

[1] https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org/

Signed-off-by: Mark Brown <broonie@kernel.org>
---
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
  validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@kernel.org

Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
  CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@kernel.org

Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
  desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@kernel.org

---
Mark Brown (5):
      mm: Introduce ARCH_HAS_USER_SHADOW_STACK
      fork: Add shadow stack support to clone3()
      selftests/clone3: Factor more of main loop into test_clone3()
      selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
      kselftest/clone3: Test shadow stack support

 arch/x86/Kconfig                                  |   1 +
 arch/x86/include/asm/shstk.h                      |  11 +-
 arch/x86/kernel/process.c                         |   2 +-
 arch/x86/kernel/shstk.c                           |  56 ++++--
 fs/proc/task_mmu.c                                |   2 +-
 include/linux/mm.h                                |   2 +-
 include/linux/sched/task.h                        |   1 +
 include/uapi/linux/sched.h                        |   4 +
 kernel/fork.c                                     |  53 ++++--
 mm/Kconfig                                        |   6 +
 tools/testing/selftests/clone3/clone3.c           | 200 +++++++++++++++++-----
 tools/testing/selftests/clone3/clone3_selftests.h |   7 +
 12 files changed, 268 insertions(+), 77 deletions(-)
---
base-commit: 98b1cc82c4affc16f5598d4fa14b1858671b2263
change-id: 20231019-clone3-shadow-stack-15d40d2bf536

Best regards,

Catalin Marinas Nov. 30, 2023, 7 p.m. UTC | #1

Hi Mark,

Thanks for putting this together and sorry it took me some time to catch
up (well, still not fully, so rather more questions below).

On Tue, Nov 28, 2023 at 06:22:38PM +0000, Mark Brown wrote:
> Since clone3() is readily extensible let's add support for specifying a
> shadow stack when creating a new thread or process in a similar manner
> to how the normal stack is specified, keeping the current implicit
> allocation behaviour if one is not specified either with clone3() or
> through the use of clone().  Unlike normal stacks only the shadow stack
> size is specified, similar issues to those that lead to the creation of
> map_shadow_stack() apply.

My hope when looking at the arm64 patches was that we can completely
avoid the kernel allocation/deallocation of the shadow stack since it
doesn't need to do this for the normal stack either. Could someone
please summarise why we dropped the shadow stack pointer after v1? IIUC
there was a potential security argument but I don't think it was a very
strong one. Also what's the threat model for this feature? I thought
it's mainly mitigating stack corruption. If some rogue code can do
syscalls, we have bigger problems than clone3() taking a shadow stack
pointer.

My (probably wrong) mental model was that libc can do an mmap() for
normal stack, a map_shadow_stack() for the shadow one and invoke
clone3() with both these pointers and sizes. There is an overhead of an
additional syscall but if some high-performance app needs to spawn
threads quickly, it would most likely do some pooling.

I'm not against clone3() getting a shadow_stack_size argument but asking
some more questions. If we won't pass a pointer as well, is there any
advantage in expanding this syscall vs a specific prctl() option? Do we
need a different size per thread or do all threads have the same shadow
stack size? A new RLIMIT doesn't seem to map well though, it is more
like an upper limit rather than a fixed/default size (glibc I think uses
it for thread stacks but bionic or musl don't AFAIK).

Another dumb question on arm64 - is GCSPR_EL0 writeable by the user? If
yes, can the libc wrapper for threads allocate a shadow stack via
map_shadow_stack() and set it up in the thread initialisation handler
before invoking the thread function?

Thanks.

Mark Brown Nov. 30, 2023, 9:51 p.m. UTC | #2

On Thu, Nov 30, 2023 at 07:00:58PM +0000, Catalin Marinas wrote:

> My hope when looking at the arm64 patches was that we can completely
> avoid the kernel allocation/deallocation of the shadow stack since it
> doesn't need to do this for the normal stack either. Could someone
> please summarise why we dropped the shadow stack pointer after v1? IIUC
> there was a potential security argument but I don't think it was a very
> strong one. Also what's the threat model for this feature? I thought
> it's mainly mitigating stack corruption. If some rogue code can do
> syscalls, we have bigger problems than clone3() taking a shadow stack
> pointer.

As well as preventing/detecting corruption of the in memory stack shadow
stacks are also ensuring that any return instructions are unwinding a
prior call instruction, and that the returns are done in opposite order
to the calls.  This forces usage of the stack - any value we attempt to
RET to is going to be checked against the top of the shadow stack which
makes chaining returns together as a substitute for branches harder.

The concern Rick raised was that allowing user to pick the exact shadow
stack pointer would allow userspace to corrupt or reuse the stack of an
existing thread by starting a new thread with the shadow stack pointing
into the existing shadow stack of that thread.  While in isolation
that's not too much more than what userspace could just do directly
anyway it might compose with other issues to something more "interesting"
(eg, I'd be a bit concerned about overlap with pkeys/POE though I've not
thought through potential uses in detail).

> I'm not against clone3() getting a shadow_stack_size argument but asking
> some more questions. If we won't pass a pointer as well, is there any
> advantage in expanding this syscall vs a specific prctl() option? Do we
> need a different size per thread or do all threads have the same shadow
> stack size? A new RLIMIT doesn't seem to map well though, it is more
> like an upper limit rather than a fixed/default size (glibc I think uses
> it for thread stacks but bionic or musl don't AFAIK).

I don't know what the userspace patterns are likely to be here, it's
possible a single value for each process might be fine but I couldn't
say that confidently.  I agree that a RLIMIT does seem like a poor fit.

As well as the actual configuration of the size the other thing that we
gain is that as well as relying on heuristics to determine if we need to
allocate a new shadow stack for the new thread we allow userspace to
explicitly request a new shadow stack.  There was some corner case with
IIRC posix_nspawn() mentioned where the heuristics aren't what we want
for example.

> Another dumb question on arm64 - is GCSPR_EL0 writeable by the user? If
> yes, can the libc wrapper for threads allocate a shadow stack via
> map_shadow_stack() and set it up in the thread initialisation handler
> before invoking the thread function?

No, GCSPR_EL0 can only be changed by EL0 through BL, RET and the
new GCS instructions (push/pop and stack switch).  Push is optional -
userspace has to explicitly request that it be enabled and this could be
prevented through seccomp or some other LSM.  The stack switch
instructions require a token at the destination address which must
either be written by a higher EL or will be written in the process of
switching away from a stack so you can switch back.  Unless I've missed
one every mechanism for userspace to update GCSPR_EL0 will do a GCS
memory access so providing guard pages have been allocated wrapping to a
different stack will be prevented.

We would need a syscall to allow GCSPR_EL0 to be written.

Edgecombe, Rick P Nov. 30, 2023, 11:37 p.m. UTC | #3

On Thu, 2023-11-30 at 21:51 +0000, Mark Brown wrote:
> On Thu, Nov 30, 2023 at 07:00:58PM +0000, Catalin Marinas wrote:
> 
> > My hope when looking at the arm64 patches was that we can
> > completely
> > avoid the kernel allocation/deallocation of the shadow stack since
> > it
> > doesn't need to do this for the normal stack either. Could someone
> > please summarise why we dropped the shadow stack pointer after v1?
> > IIUC
> > there was a potential security argument but I don't think it was a
> > very
> > strong one. Also what's the threat model for this feature? I
> > thought
> > it's mainly mitigating stack corruption. If some rogue code can do
> > syscalls, we have bigger problems than clone3() taking a shadow
> > stack
> > pointer.
> 
> As well as preventing/detecting corruption of the in memory stack
> shadow
> stacks are also ensuring that any return instructions are unwinding a
> prior call instruction, and that the returns are done in opposite
> order
> to the calls.  This forces usage of the stack - any value we attempt
> to
> RET to is going to be checked against the top of the shadow stack
> which
> makes chaining returns together as a substitute for branches harder.
> 
> The concern Rick raised was that allowing user to pick the exact
> shadow
> stack pointer would allow userspace to corrupt or reuse the stack of
> an
> existing thread by starting a new thread with the shadow stack
> pointing
> into the existing shadow stack of that thread.  While in isolation
> that's not too much more than what userspace could just do directly
> anyway it might compose with other issues to something more
> "interesting"
> (eg, I'd be a bit concerned about overlap with pkeys/POE though I've
> not
> thought through potential uses in detail).

I think it is open for userspace customization. The kernel tries to
leave the option to lock things down as much as it can (partly because
it's not clear how all the userspace tradeoffs will shake out).

In the past, we had talked about allowing a set SSP (GCSPR) prctl() to
help with some of the compatibility gaps (longjmp() between stacks,
etc). If we loosened things up a bit this could help there, but it kind
of defeats the purpose a little, of the token checking stuff built into
these features at the HW level. A super-stack-canary mode might be nice
for people who just want to flip a switch on existing apps without
checking them, or people who want to do tracing and don't care about
security. But, I also wouldn't be surprised if some high security
applications decide to block map_shadow_stack all together to lock
threads to their own shadow stacks.

So I kind of like leaning towards leaving the option to lock things
down more when we can. Like Mark was getting at, we don't know all the
ways shadow stacks will get attacked yet. So turning it around, why not
let the shadow stack get allocated by the kernel? It makes the kernel
code/complexity smaller, are there any other benefits?

> 
> > I'm not against clone3() getting a shadow_stack_size argument but
> > asking
> > some more questions. If we won't pass a pointer as well, is there
> > any
> > advantage in expanding this syscall vs a specific prctl() option?
> > Do we
> > need a different size per thread or do all threads have the same
> > shadow
> > stack size? A new RLIMIT doesn't seem to map well though, it is
> > more
> > like an upper limit rather than a fixed/default size (glibc I think
> > uses
> > it for thread stacks but bionic or musl don't AFAIK).
> 
> I don't know what the userspace patterns are likely to be here, it's
> possible a single value for each process might be fine but I couldn't
> say that confidently.  I agree that a RLIMIT does seem like a poor
> fit.
> 
> As well as the actual configuration of the size the other thing that
> we
> gain is that as well as relying on heuristics to determine if we need
> to
> allocate a new shadow stack for the new thread we allow userspace to
> explicitly request a new shadow stack.  There was some corner case
> with
> IIRC posix_nspawn() mentioned where the heuristics aren't what we
> want
> for example.

Can't posix_spawn() pass in a shadow stack size into clone3 to get a
new shadow stack after this series?

> 
> > Another dumb question on arm64 - is GCSPR_EL0 writeable by the
> > user? If
> > yes, can the libc wrapper for threads allocate a shadow stack via
> > map_shadow_stack() and set it up in the thread initialisation
> > handler
> > before invoking the thread function?
> 
> No, GCSPR_EL0 can only be changed by EL0 through BL, RET and the
> new GCS instructions (push/pop and stack switch).  Push is optional -
> userspace has to explicitly request that it be enabled and this could
> be
> prevented through seccomp or some other LSM.  The stack switch
> instructions require a token at the destination address which must
> either be written by a higher EL or will be written in the process of
> switching away from a stack so you can switch back.  Unless I've
> missed
> one every mechanism for userspace to update GCSPR_EL0 will do a GCS
> memory access so providing guard pages have been allocated wrapping
> to a
> different stack will be prevented.
> 
> We would need a syscall to allow GCSPR_EL0 to be written.

I think the problem with doing this is signals. If a signal is
delivered to the new thread, then it could push to the old shadow stack
before userspace gets a chance to switch. So the thread needs to start
on a new shadow/stack.

Szabolcs Nagy Dec. 1, 2023, 11:50 a.m. UTC | #4

The 11/30/2023 21:51, Mark Brown wrote:
> The concern Rick raised was that allowing user to pick the exact shadow
> stack pointer would allow userspace to corrupt or reuse the stack of an
> existing thread by starting a new thread with the shadow stack pointing
> into the existing shadow stack of that thread.  While in isolation

note that this can be prevented by map_shadow_stack adding
a token that clone3 verifies.

> that's not too much more than what userspace could just do directly
> anyway it might compose with other issues to something more "interesting"
> (eg, I'd be a bit concerned about overlap with pkeys/POE though I've not
> thought through potential uses in detail).
>
> > I'm not against clone3() getting a shadow_stack_size argument but asking
> > some more questions. If we won't pass a pointer as well, is there any
> > advantage in expanding this syscall vs a specific prctl() option? Do we
> > need a different size per thread or do all threads have the same shadow
> > stack size? A new RLIMIT doesn't seem to map well though, it is more
> > like an upper limit rather than a fixed/default size (glibc I think uses
> > it for thread stacks but bionic or musl don't AFAIK).
>
> I don't know what the userspace patterns are likely to be here, it's
> possible a single value for each process might be fine but I couldn't
> say that confidently.  I agree that a RLIMIT does seem like a poor fit.

user code can control the thread stack size per thread
and different size per thread happens in practice (even
in the libc e.g. timer_create with SIGEV_THREAD uses
different stack size than the dns resolver helper thread).
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Mark Brown Dec. 1, 2023, 1:47 p.m. UTC | #5

On Fri, Dec 01, 2023 at 11:50:25AM +0000, Szabolcs Nagy wrote:
> The 11/30/2023 21:51, Mark Brown wrote:
> > The concern Rick raised was that allowing user to pick the exact shadow
> > stack pointer would allow userspace to corrupt or reuse the stack of an
> > existing thread by starting a new thread with the shadow stack pointing
> > into the existing shadow stack of that thread.  While in isolation

> note that this can be prevented by map_shadow_stack adding
> a token that clone3 verifies.

That would make it impossible to reuse the shadow stack once the token
is overwritten which does move the needle more towards making doing the
mapping separately pure overhead.

Mark Brown Dec. 1, 2023, 2 p.m. UTC | #6

On Thu, Nov 30, 2023 at 11:37:42PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2023-11-30 at 21:51 +0000, Mark Brown wrote:
> > On Thu, Nov 30, 2023 at 07:00:58PM +0000, Catalin Marinas wrote:

> > explicitly request a new shadow stack.  There was some corner case
> > with
> > IIRC posix_nspawn() mentioned where the heuristics aren't what we
> > want
> > for example.

> Can't posix_spawn() pass in a shadow stack size into clone3 to get a
> new shadow stack after this series?

Yes, the above was addressing Catalin's suggestion that we add stack
size control separately to clone3() instead - doing that would remove
the ability to explicitly request a new stack unless we add a flag to
clone3() at which point we're back to modifying clone3() anyway.

> > > Another dumb question on arm64 - is GCSPR_EL0 writeable by the
> > > user? If
> > > yes, can the libc wrapper for threads allocate a shadow stack via
> > > map_shadow_stack() and set it up in the thread initialisation
> > > handler
> > > before invoking the thread function?

> > We would need a syscall to allow GCSPR_EL0 to be written.

> I think the problem with doing this is signals. If a signal is
> delivered to the new thread, then it could push to the old shadow stack
> before userspace gets a chance to switch. So the thread needs to start
> on a new shadow/stack.

That's an issue, plus using a syscall just wouldn't work with a security
model that locked down writes to the pointer which does seem like
something people would reasonably want to deploy.

Catalin Marinas Dec. 1, 2023, 5:30 p.m. UTC | #7

Thanks all for the clarification.

On Thu, Nov 30, 2023 at 09:51:04PM +0000, Mark Brown wrote:
> On Thu, Nov 30, 2023 at 07:00:58PM +0000, Catalin Marinas wrote:
> > My hope when looking at the arm64 patches was that we can completely
> > avoid the kernel allocation/deallocation of the shadow stack since it
> > doesn't need to do this for the normal stack either. Could someone
> > please summarise why we dropped the shadow stack pointer after v1? IIUC
> > there was a potential security argument but I don't think it was a very
> > strong one. Also what's the threat model for this feature? I thought
> > it's mainly mitigating stack corruption. If some rogue code can do
> > syscalls, we have bigger problems than clone3() taking a shadow stack
> > pointer.
> 
> As well as preventing/detecting corruption of the in memory stack shadow
> stacks are also ensuring that any return instructions are unwinding a
> prior call instruction, and that the returns are done in opposite order
> to the calls.  This forces usage of the stack - any value we attempt to
> RET to is going to be checked against the top of the shadow stack which
> makes chaining returns together as a substitute for branches harder.
> 
> The concern Rick raised was that allowing user to pick the exact shadow
> stack pointer would allow userspace to corrupt or reuse the stack of an
> existing thread by starting a new thread with the shadow stack pointing
> into the existing shadow stack of that thread.  While in isolation
> that's not too much more than what userspace could just do directly
> anyway it might compose with other issues to something more "interesting"
> (eg, I'd be a bit concerned about overlap with pkeys/POE though I've not
> thought through potential uses in detail).

Another concern I had was that map_shadow_stack() currently takes
a flags arg (though only one flag) while the clone/clone3() allocate the
shadow stack with an implicit configuration (other than size). Would
map_shadow_stack() ever get new flags that we may also need to set on
the default thread shadow stack (e.g. a new permission type)? At that
point it would be better if clone3() allowed a shadow stack pointer so
that any specific attributes would be limited to map_shadow_stack().

If that's only theoretical, I'm fine to go ahead with a size-only
argument for clone3(). We could also add the pointer now and allocate
the stack if NULL or reuse it if not, maybe with some prctl to allow
this. It might be overengineering and we'd never use such feature
though.

> > I'm not against clone3() getting a shadow_stack_size argument but asking
> > some more questions. If we won't pass a pointer as well, is there any
> > advantage in expanding this syscall vs a specific prctl() option? Do we
> > need a different size per thread or do all threads have the same shadow
> > stack size? A new RLIMIT doesn't seem to map well though, it is more
> > like an upper limit rather than a fixed/default size (glibc I think uses
> > it for thread stacks but bionic or musl don't AFAIK).
> 
> I don't know what the userspace patterns are likely to be here, it's
> possible a single value for each process might be fine but I couldn't
> say that confidently.  I agree that a RLIMIT does seem like a poor fit.

Szabolcs clarified that there are cases where we need the size per
thread.

> As well as the actual configuration of the size the other thing that we
> gain is that as well as relying on heuristics to determine if we need to
> allocate a new shadow stack for the new thread we allow userspace to
> explicitly request a new shadow stack.

But the reverse is not true - we can't use clone3() to create a thread
without a shadow stack AFAICT.

> > Another dumb question on arm64 - is GCSPR_EL0 writeable by the user? If
> > yes, can the libc wrapper for threads allocate a shadow stack via
> > map_shadow_stack() and set it up in the thread initialisation handler
> > before invoking the thread function?
> 
> No, GCSPR_EL0 can only be changed by EL0 through BL, RET and the
> new GCS instructions (push/pop and stack switch).  Push is optional -
> userspace has to explicitly request that it be enabled and this could be
> prevented through seccomp or some other LSM.  The stack switch
> instructions require a token at the destination address which must
> either be written by a higher EL or will be written in the process of
> switching away from a stack so you can switch back.  Unless I've missed
> one every mechanism for userspace to update GCSPR_EL0 will do a GCS
> memory access so providing guard pages have been allocated wrapping to a
> different stack will be prevented.
> 
> We would need a syscall to allow GCSPR_EL0 to be written.

Good point, I thought I must be missing something.

Mark Brown Dec. 1, 2023, 6:28 p.m. UTC | #8

On Fri, Dec 01, 2023 at 05:30:22PM +0000, Catalin Marinas wrote:

> Another concern I had was that map_shadow_stack() currently takes
> a flags arg (though only one flag) while the clone/clone3() allocate the
> shadow stack with an implicit configuration (other than size). Would
> map_shadow_stack() ever get new flags that we may also need to set on
> the default thread shadow stack (e.g. a new permission type)? At that
> point it would be better if clone3() allowed a shadow stack pointer so
> that any specific attributes would be limited to map_shadow_stack().

The flags argument currently only lets you specify if a stack switch
token should be written (which is not relevant for the clone3() case)
and if a top of stack marker should be included (which since the top of
stack marker is NULL for arm64 only has perceptible effect if a token is
being written).  I'm not particularly anticipating any further additions,
though never say never.

> If that's only theoretical, I'm fine to go ahead with a size-only
> argument for clone3(). We could also add the pointer now and allocate
> the stack if NULL or reuse it if not, maybe with some prctl to allow
> this. It might be overengineering and we'd never use such feature
> though.

Yeah, it seems like a bunch of work and interface to test that I'm not
convinced anyone would actually use.

> > As well as the actual configuration of the size the other thing that we
> > gain is that as well as relying on heuristics to determine if we need to
> > allocate a new shadow stack for the new thread we allow userspace to
> > explicitly request a new shadow stack.

> But the reverse is not true - we can't use clone3() to create a thread
> without a shadow stack AFAICT.

Right.  Given the existing implicit allocation only x86 ABI we'd need to
retrofit that by adding an explicit "no shadow stack" flag.  That is
possible though I'm having a hard time seeing the use case for it.

Robert O'Callahan Dec. 9, 2023, 12:59 a.m. UTC | #9

On Wed, 29 Nov 2023 at 07:31, Mark Brown <broonie@kernel.org> wrote:
> Since clone3() is readily extensible let's add support for specifying a
> shadow stack when creating a new thread or process in a similar manner
> to how the normal stack is specified, keeping the current implicit
> allocation behaviour if one is not specified either with clone3() or
> through the use of clone().  Unlike normal stacks only the shadow stack
> size is specified, similar issues to those that lead to the creation of
> map_shadow_stack() apply.

rr (https://rr-project.org) records program execution and then reruns
it with exactly the same behavior (down to memory contents and
register values). To replay clone() etc in an application using shadow
stacks, we'll need to be able to ensure the shadow stack is mapped at
the same address during the replay run as during the recording run. We
ptrace the replay tasks and have the ability to execute arbitrary
syscalls in them. It sounds like we might be able to make this work by
overriding clone_args::shadow_stack_size to zero in the call to
clone3(), instead having the replay task call map_shadow_stack() to
put the the shadow stack in the right place, and then setting its SSP
via ptrace. Will that work?

Thanks,
Rob

Mark Brown Dec. 9, 2023, 1:06 a.m. UTC | #10

On Sat, Dec 09, 2023 at 01:59:16PM +1300, Robert O'Callahan wrote:

> overriding clone_args::shadow_stack_size to zero in the call to
> clone3(), instead having the replay task call map_shadow_stack() to
> put the the shadow stack in the right place, and then setting its SSP
> via ptrace. Will that work?

That should work with the interface in the current series, yes.

[RFT,v4,0/5] fork: Support shadow stacks in clone3()

Message

Comments