diff mbox series

[v27,24/31] x86/cet/shstk: Handle thread shadow stack

Message ID 20210521221211.29077-25-yu-cheng.yu@intel.com (mailing list archive)
State New, archived
Headers show
Series Control-flow Enforcement: Shadow Stack | expand

Commit Message

Yu-cheng Yu May 21, 2021, 10:12 p.m. UTC
For clone() with CLONE_VM, except vfork, the child and the parent must have
separate shadow stacks.  Thus, the kernel allocates, and frees on thread
exit a new shadow stack for the child.

Use stack_size passed from clone3() syscall for thread shadow stack size.
A compat-mode thread shadow stack size is further reduced to 1/4.  This
allows more threads to run in a 32-bit address space.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/cet.h         |  5 +++
 arch/x86/include/asm/mmu_context.h |  3 ++
 arch/x86/kernel/process.c          | 15 +++++---
 arch/x86/kernel/shstk.c            | 55 +++++++++++++++++++++++++++++-
 4 files changed, 73 insertions(+), 5 deletions(-)

Comments

Andy Lutomirski May 22, 2021, 11:39 p.m. UTC | #1
On Fri, May 21, 2021 at 3:14 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 5ea2b494e9f9..8e5f772181b9 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -71,6 +71,53 @@ int shstk_setup(void)
>         return 0;
>  }
>
> +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> +                            unsigned long stack_size)
> +{

...

> +       state = get_xsave_addr(&tsk->thread.fpu.state.xsave, XFEATURE_CET_USER);
> +       if (!state)
> +               return -EINVAL;
> +

The get_xsave_addr() API is horrible, and we already have one
egregiously buggy instance in the kernel.  Let's not add another
dubious use case.

If state == NULL, this means that CET_USER is in its init state.
CET_USER in the init state should behave identically regardless of
whether XINUSE[CET_USER] is set.  Can you please adjust this code
accordingly?

Thanks,
Andy
Yu-cheng Yu May 25, 2021, 3:04 p.m. UTC | #2
On 5/22/2021 4:39 PM, Andy Lutomirski wrote:
> On Fri, May 21, 2021 at 3:14 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
>> index 5ea2b494e9f9..8e5f772181b9 100644
>> --- a/arch/x86/kernel/shstk.c
>> +++ b/arch/x86/kernel/shstk.c
>> @@ -71,6 +71,53 @@ int shstk_setup(void)
>>          return 0;
>>   }
>>
>> +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
>> +                            unsigned long stack_size)
>> +{
> 
> ...
> 
>> +       state = get_xsave_addr(&tsk->thread.fpu.state.xsave, XFEATURE_CET_USER);
>> +       if (!state)
>> +               return -EINVAL;
>> +
> 
> The get_xsave_addr() API is horrible, and we already have one
> egregiously buggy instance in the kernel.  Let's not add another
> dubious use case.
> 
> If state == NULL, this means that CET_USER is in its init state.
> CET_USER in the init state should behave identically regardless of
> whether XINUSE[CET_USER] is set.  Can you please adjust this code
> accordingly?
> 

I will work on that.

Thanks,
Yu-cheng
John Allen July 21, 2021, 6:14 p.m. UTC | #3
On Fri, May 21, 2021 at 03:12:04PM -0700, Yu-cheng Yu wrote:
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 5ea2b494e9f9..8e5f772181b9 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -71,6 +71,53 @@ int shstk_setup(void)
>  	return 0;
>  }
>  
> +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> +			     unsigned long stack_size)
> +{
> +	struct thread_shstk *shstk = &tsk->thread.shstk;
> +	struct cet_user_state *state;
> +	unsigned long addr;
> +
> +	if (!stack_size)
> +		return -EINVAL;

I've been doing some light testing on AMD hardware and I've found that
this version of the patchset doesn't boot for me. It appears that when
systemd processes start spawning, they hit the above case, return
-EINVAL, and the fork fails. In these cases, copy_thread has been passed
0 for both sp and stack_size.

For previous versions of the patchset, I can still boot. When the
stack_size check was last, the function would always return before
completing the check, hitting one of the two cases below.

At the very least, it would seem that on some systems, it isn't valid to
rely on the stack_size passed from clone3, though I'm unsure what the
correct behavior should be here. If the passed stack_size == 0 and sp ==
0, is this a case where we want to alloc a shadow stack for this thread
with some capped size? Alternatively, is this a case that isn't valid to
alloc a shadow stack and we should simply return 0 instead of -EINVAL?

I'm running Fedora 34 which satisfies the required versions of gcc,
binutils, and glibc.

Please let me know if there is any additional information I can provide.

Thanks,
John

> +
> +	if (!shstk->size)
> +		return 0;
> +
> +	/*
> +	 * For CLONE_VM, except vfork, the child needs a separate shadow
> +	 * stack.
> +	 */
> +	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
> +		return 0;
> +
> +	state = get_xsave_addr(&tsk->thread.fpu.state.xsave, XFEATURE_CET_USER);
> +	if (!state)
> +		return -EINVAL;
> +
> +	/*
> +	 * Compat-mode pthreads share a limited address space.
> +	 * If each function call takes an average of four slots
> +	 * stack space, allocate 1/4 of stack size for shadow stack.
> +	 */
> +	if (in_compat_syscall())
> +		stack_size /= 4;
> +
> +	stack_size = round_up(stack_size, PAGE_SIZE);
> +	addr = alloc_shstk(stack_size);
> +	if (IS_ERR_VALUE(addr)) {
> +		shstk->base = 0;
> +		shstk->size = 0;
> +		return PTR_ERR((void *)addr);
> +	}
> +
> +	fpu__prepare_write(&tsk->thread.fpu);
> +	state->user_ssp = (u64)(addr + stack_size);
> +	shstk->base = addr;
> +	shstk->size = stack_size;
> +	return 0;
> +}
> +
>  void shstk_free(struct task_struct *tsk)
>  {
>  	struct thread_shstk *shstk = &tsk->thread.shstk;
> @@ -80,7 +127,13 @@ void shstk_free(struct task_struct *tsk)
>  	    !shstk->base)
>  		return;
>  
> -	if (!tsk->mm)
> +	/*
> +	 * When fork() with CLONE_VM fails, the child (tsk) already has a
> +	 * shadow stack allocated, and exit_thread() calls this function to
> +	 * free it.  In this case the parent (current) and the child share
> +	 * the same mm struct.
> +	 */
> +	if (!tsk->mm || tsk->mm != current->mm)
>  		return;
>  
>  	while (1) {
Florian Weimer July 21, 2021, 6:28 p.m. UTC | #4
* John Allen:

> At the very least, it would seem that on some systems, it isn't valid to
> rely on the stack_size passed from clone3, though I'm unsure what the
> correct behavior should be here. If the passed stack_size == 0 and sp ==
> 0, is this a case where we want to alloc a shadow stack for this thread
> with some capped size? Alternatively, is this a case that isn't valid to
> alloc a shadow stack and we should simply return 0 instead of -EINVAL?
>
> I'm running Fedora 34 which satisfies the required versions of gcc,
> binutils, and glibc.

Fedora 34 doesn't use clone3 yet.  You can upgrade to a rawhide build,
e.g. glibc-2.33.9000-46.fc35:

  <https://koji.fedoraproject.org/koji/buildinfo?buildID=1782678>

It's currently not in main rawhide because the Firefox sandbox breaks
clone3.  The “fix” is that clone3 will fail with ENOSYS under the
sandbox.

I expect that container runtimes turn clone3 into clone in the same way
(via ENOSYS), at least for the medium term.  So it would make sense to
allocate some sort of shadow stack for clone as well, if that's possible
to implement in some way.

Thanks,
Florian
Yu-cheng Yu July 21, 2021, 6:34 p.m. UTC | #5
On 7/21/2021 11:28 AM, Florian Weimer wrote:
> * John Allen:
> 
>> At the very least, it would seem that on some systems, it isn't valid to
>> rely on the stack_size passed from clone3, though I'm unsure what the
>> correct behavior should be here. If the passed stack_size == 0 and sp ==
>> 0, is this a case where we want to alloc a shadow stack for this thread
>> with some capped size? Alternatively, is this a case that isn't valid to
>> alloc a shadow stack and we should simply return 0 instead of -EINVAL?
>>
>> I'm running Fedora 34 which satisfies the required versions of gcc,
>> binutils, and glibc.
> 
> Fedora 34 doesn't use clone3 yet.  You can upgrade to a rawhide build,
> e.g. glibc-2.33.9000-46.fc35:
> 
>    <https://koji.fedoraproject.org/koji/buildinfo?buildID=1782678>
> 
> It's currently not in main rawhide because the Firefox sandbox breaks
> clone3.  The “fix” is that clone3 will fail with ENOSYS under the
> sandbox.
> 
> I expect that container runtimes turn clone3 into clone in the same way
> (via ENOSYS), at least for the medium term.  So it would make sense to
> allocate some sort of shadow stack for clone as well, if that's possible
> to implement in some way.
> 
> Thanks,
> Florian
> 

Thanks Florian!  And because of that reason, we will put back clone2 
support in my next v28 patches.

Yu-cheng
Dave Hansen July 21, 2021, 6:37 p.m. UTC | #6
On 7/21/21 11:14 AM, John Allen wrote:
>> +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
>> +			     unsigned long stack_size)
>> +{
>> +	struct thread_shstk *shstk = &tsk->thread.shstk;
>> +	struct cet_user_state *state;
>> +	unsigned long addr;
>> +
>> +	if (!stack_size)
>> +		return -EINVAL;
> I've been doing some light testing on AMD hardware and I've found that
> this version of the patchset doesn't boot for me. It appears that when
> systemd processes start spawning, they hit the above case, return
> -EINVAL, and the fork fails. In these cases, copy_thread has been passed
> 0 for both sp and stack_size.

A few tangential things I noticed:

This hunk is not mentioned in the version changelog at all.  I also
don't see any feedback that might have prompted it.  This is one reason
per-patch changelogs are preferred.

As a general rule, new features should strive to be implemented in a way
that it's *obvious* that they won't break old code.
shstk_alloc_thread_stack() fails that test for me.  If it had:

	if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) // or whatever
		return 0;

in the function, it would be obviously harmless.  Better yet would be
doing the feature check at the shstk_alloc_thread_stack() call site,
that way even the function call can be optimized out.

Further, this confused me because the changelog didn't even mention the
arg -> stack_size rename.  That would have been nice for another patch,
or an extra sentence in the changelog.
H.J. Lu July 21, 2021, 8:14 p.m. UTC | #7
On Wed, Jul 21, 2021 at 11:15 AM John Allen <john.allen@amd.com> wrote:
>
> On Fri, May 21, 2021 at 03:12:04PM -0700, Yu-cheng Yu wrote:
> > diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> > index 5ea2b494e9f9..8e5f772181b9 100644
> > --- a/arch/x86/kernel/shstk.c
> > +++ b/arch/x86/kernel/shstk.c
> > @@ -71,6 +71,53 @@ int shstk_setup(void)
> >       return 0;
> >  }
> >
> > +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> > +                          unsigned long stack_size)
> > +{
> > +     struct thread_shstk *shstk = &tsk->thread.shstk;
> > +     struct cet_user_state *state;
> > +     unsigned long addr;
> > +
> > +     if (!stack_size)
> > +             return -EINVAL;
>
> I've been doing some light testing on AMD hardware and I've found that
> this version of the patchset doesn't boot for me. It appears that when
> systemd processes start spawning, they hit the above case, return
> -EINVAL, and the fork fails. In these cases, copy_thread has been passed
> 0 for both sp and stack_size.
>
> For previous versions of the patchset, I can still boot. When the
> stack_size check was last, the function would always return before
> completing the check, hitting one of the two cases below.
>
> At the very least, it would seem that on some systems, it isn't valid to
> rely on the stack_size passed from clone3, though I'm unsure what the
> correct behavior should be here. If the passed stack_size == 0 and sp ==
> 0, is this a case where we want to alloc a shadow stack for this thread
> with some capped size? Alternatively, is this a case that isn't valid to
> alloc a shadow stack and we should simply return 0 instead of -EINVAL?
>
> I'm running Fedora 34 which satisfies the required versions of gcc,
> binutils, and glibc.
>
> Please let me know if there is any additional information I can provide.

FWIW, I have been maintaining stable CET kernels at:

https://github.com/hjl-tools/linux/

The current CET kernel is on hjl/cet/linux-5.13.y branch.

> Thanks,
> John
>
> > +
> > +     if (!shstk->size)
> > +             return 0;
> > +
> > +     /*
> > +      * For CLONE_VM, except vfork, the child needs a separate shadow
> > +      * stack.
> > +      */
> > +     if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
> > +             return 0;
> > +
> > +     state = get_xsave_addr(&tsk->thread.fpu.state.xsave, XFEATURE_CET_USER);
> > +     if (!state)
> > +             return -EINVAL;
> > +
> > +     /*
> > +      * Compat-mode pthreads share a limited address space.
> > +      * If each function call takes an average of four slots
> > +      * stack space, allocate 1/4 of stack size for shadow stack.
> > +      */
> > +     if (in_compat_syscall())
> > +             stack_size /= 4;
> > +
> > +     stack_size = round_up(stack_size, PAGE_SIZE);
> > +     addr = alloc_shstk(stack_size);
> > +     if (IS_ERR_VALUE(addr)) {
> > +             shstk->base = 0;
> > +             shstk->size = 0;
> > +             return PTR_ERR((void *)addr);
> > +     }
> > +
> > +     fpu__prepare_write(&tsk->thread.fpu);
> > +     state->user_ssp = (u64)(addr + stack_size);
> > +     shstk->base = addr;
> > +     shstk->size = stack_size;
> > +     return 0;
> > +}
> > +
> >  void shstk_free(struct task_struct *tsk)
> >  {
> >       struct thread_shstk *shstk = &tsk->thread.shstk;
> > @@ -80,7 +127,13 @@ void shstk_free(struct task_struct *tsk)
> >           !shstk->base)
> >               return;
> >
> > -     if (!tsk->mm)
> > +     /*
> > +      * When fork() with CLONE_VM fails, the child (tsk) already has a
> > +      * shadow stack allocated, and exit_thread() calls this function to
> > +      * free it.  In this case the parent (current) and the child share
> > +      * the same mm struct.
> > +      */
> > +     if (!tsk->mm || tsk->mm != current->mm)
> >               return;
> >
> >       while (1) {
John Allen July 28, 2021, 9:34 p.m. UTC | #8
On Wed, Jul 21, 2021 at 11:34:53AM -0700, Yu, Yu-cheng wrote:
> On 7/21/2021 11:28 AM, Florian Weimer wrote:
> > I expect that container runtimes turn clone3 into clone in the same way
> > (via ENOSYS), at least for the medium term.  So it would make sense to
> > allocate some sort of shadow stack for clone as well, if that's possible
> > to implement in some way.
> > 
> > Thanks,
> > Florian
> > 
> 
> Thanks Florian!  And because of that reason, we will put back clone2 support
> in my next v28 patches.
> 
> Yu-cheng

I tested with v28 of the patches on the same system and it appears to
fix the issue I was seeing.

Thanks,
John
diff mbox series

Patch

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 6432baf4de1f..4314a41ab3c9 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -17,10 +17,15 @@  struct thread_shstk {
 
 #ifdef CONFIG_X86_SHADOW_STACK
 int shstk_setup(void);
+int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
+			     unsigned long stack_size);
 void shstk_free(struct task_struct *p);
 void shstk_disable(void);
 #else
 static inline int shstk_setup(void) { return 0; }
+static inline int shstk_alloc_thread_stack(struct task_struct *p,
+					   unsigned long clone_flags,
+					   unsigned long stack_size) { return 0; }
 static inline void shstk_free(struct task_struct *p) {}
 static inline void shstk_disable(void) {}
 #endif
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 27516046117a..e1dd083261a5 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -12,6 +12,7 @@ 
 #include <asm/tlbflush.h>
 #include <asm/paravirt.h>
 #include <asm/debugreg.h>
+#include <asm/cet.h>
 
 extern atomic64_t last_mm_ctx_id;
 
@@ -146,6 +147,8 @@  do {						\
 #else
 #define deactivate_mm(tsk, mm)			\
 do {						\
+	if (!tsk->vfork_done)			\
+		shstk_free(tsk);		\
 	load_gs_index(0);			\
 	loadsegment(fs, 0);			\
 } while (0)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 5e1f38179f49..7a583a28ddb2 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -43,6 +43,7 @@ 
 #include <asm/io_bitmap.h>
 #include <asm/proto.h>
 #include <asm/frame.h>
+#include <asm/cet.h>
 
 #include "process.h"
 
@@ -104,6 +105,7 @@  void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
+	shstk_free(tsk);
 	fpu__drop(fpu);
 }
 
@@ -117,8 +119,9 @@  static int set_new_tls(struct task_struct *p, unsigned long tls)
 		return do_set_thread_area_64(p, ARCH_SET_FS, tls);
 }
 
-int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+int copy_thread(unsigned long clone_flags, unsigned long sp,
+		unsigned long stack_size, struct task_struct *p,
+		unsigned long tls)
 {
 	struct inactive_task_frame *frame;
 	struct fork_frame *fork_frame;
@@ -158,7 +161,7 @@  int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	/* Kernel thread ? */
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		memset(childregs, 0, sizeof(struct pt_regs));
-		kthread_frame_init(frame, sp, arg);
+		kthread_frame_init(frame, sp, stack_size);
 		return 0;
 	}
 
@@ -185,7 +188,7 @@  int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 		 */
 		childregs->sp = 0;
 		childregs->ip = 0;
-		kthread_frame_init(frame, sp, arg);
+		kthread_frame_init(frame, sp, stack_size);
 		return 0;
 	}
 
@@ -193,6 +196,10 @@  int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	if (clone_flags & CLONE_SETTLS)
 		ret = set_new_tls(p, tls);
 
+	/* Allocate a new shadow stack for pthread */
+	if (!ret)
+		ret = shstk_alloc_thread_stack(p, clone_flags, stack_size);
+
 	if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
 		io_bitmap_share(p);
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 5ea2b494e9f9..8e5f772181b9 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -71,6 +71,53 @@  int shstk_setup(void)
 	return 0;
 }
 
+int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+			     unsigned long stack_size)
+{
+	struct thread_shstk *shstk = &tsk->thread.shstk;
+	struct cet_user_state *state;
+	unsigned long addr;
+
+	if (!stack_size)
+		return -EINVAL;
+
+	if (!shstk->size)
+		return 0;
+
+	/*
+	 * For CLONE_VM, except vfork, the child needs a separate shadow
+	 * stack.
+	 */
+	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
+		return 0;
+
+	state = get_xsave_addr(&tsk->thread.fpu.state.xsave, XFEATURE_CET_USER);
+	if (!state)
+		return -EINVAL;
+
+	/*
+	 * Compat-mode pthreads share a limited address space.
+	 * If each function call takes an average of four slots
+	 * stack space, allocate 1/4 of stack size for shadow stack.
+	 */
+	if (in_compat_syscall())
+		stack_size /= 4;
+
+	stack_size = round_up(stack_size, PAGE_SIZE);
+	addr = alloc_shstk(stack_size);
+	if (IS_ERR_VALUE(addr)) {
+		shstk->base = 0;
+		shstk->size = 0;
+		return PTR_ERR((void *)addr);
+	}
+
+	fpu__prepare_write(&tsk->thread.fpu);
+	state->user_ssp = (u64)(addr + stack_size);
+	shstk->base = addr;
+	shstk->size = stack_size;
+	return 0;
+}
+
 void shstk_free(struct task_struct *tsk)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
@@ -80,7 +127,13 @@  void shstk_free(struct task_struct *tsk)
 	    !shstk->base)
 		return;
 
-	if (!tsk->mm)
+	/*
+	 * When fork() with CLONE_VM fails, the child (tsk) already has a
+	 * shadow stack allocated, and exit_thread() calls this function to
+	 * free it.  In this case the parent (current) and the child share
+	 * the same mm struct.
+	 */
+	if (!tsk->mm || tsk->mm != current->mm)
 		return;
 
 	while (1) {