[v27,24/31] x86/cet/shstk: Handle thread shadow stack

Message ID	20210521221211.29077-25-yu-cheng.yu@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=us07=KQ=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 51CD9613EC IronPort-SDR: 3W3qXoTxvfO+RmwbSMrmFASXk9vMxhANFnXEpHz0wvs/bJ+pJWMG8/SXmQViPjTtpghwYqik9b oXTL24O4bQ6g== IronPort-SDR: CwBimpn+/zLdfTvID2MjnTDclOjSlGAE+6+kGPxsEY+hA9U2bmrdO8xpSpnUVbGeV0i2Ddgb6a fmi/ul69javA== From: Yu-cheng Yu <yu-cheng.yu@intel.com> To: x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann <arnd@arndb.de>, Andy Lutomirski <luto@kernel.org>, Balbir Singh <bsingharora@gmail.com>, Borislav Petkov <bp@alien8.de>, Cyrill Gorcunov <gorcunov@gmail.com>, Dave Hansen <dave.hansen@linux.intel.com>, Eugene Syromiatnikov <esyr@redhat.com>, Florian Weimer <fweimer@redhat.com>, "H.J. Lu" <hjl.tools@gmail.com>, Jann Horn <jannh@google.com>, Jonathan Corbet <corbet@lwn.net>, Kees Cook <keescook@chromium.org>, Mike Kravetz <mike.kravetz@oracle.com>, Nadav Amit <nadav.amit@gmail.com>, Oleg Nesterov <oleg@redhat.com>, Pavel Machek <pavel@ucw.cz>, Peter Zijlstra <peterz@infradead.org>, Randy Dunlap <rdunlap@infradead.org>, "Ravi V. Shankar" <ravi.v.shankar@intel.com>, Vedvyas Shanbhogue <vedvyas.shanbhogue@intel.com>, Dave Martin <Dave.Martin@arm.com>, Weijiang Yang <weijiang.yang@intel.com>, Pengfei Xu <pengfei.xu@intel.com>, Haitao Huang <haitao.huang@intel.com> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Subject: [PATCH v27 24/31] x86/cet/shstk: Handle thread shadow stack Date: Fri, 21 May 2021 15:12:04 -0700 Message-Id: <20210521221211.29077-25-yu-cheng.yu@intel.com> In-Reply-To: <20210521221211.29077-1-yu-cheng.yu@intel.com> References: <20210521221211.29077-1-yu-cheng.yu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Control-flow Enforcement: Shadow Stack \| expand [v27,00/31] Control-flow Enforcement: Shadow Stack [v27,01/31] Documentation/x86: Add CET description [v27,02/31] x86/cet/shstk: Add Kconfig option for Shadow Stack [v27,03/31] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) [v27,04/31] x86/cpufeatures: Introduce CPU setup and option parsing for CET [v27,05/31] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states [v27,06/31] x86/cet: Add control-protection fault handler [v27,07/31] x86/mm: Remove _PAGE_DIRTY from kernel RO pages [v27,08/31] x86/mm: Move pmd_write(), pud_write() up in the file [v27,09/31] x86/mm: Introduce _PAGE_COW [v27,10/31] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS [v27,11/31] x86/mm: Update pte_modify for _PAGE_COW [v27,12/31] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_… [v27,13/31] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 [v27,14/31] mm: Introduce VM_SHADOW_STACK for shadow stack memory [v27,15/31] x86/mm: Shadow Stack page fault error checking [v27,16/31] x86/mm: Update maybe_mkwrite() for shadow stack [v27,17/31] mm: Fixup places that call pte_mkwrite() directly [v27,18/31] mm: Add guard pages around a shadow stack. [v27,19/31] mm/mmap: Add shadow stack pages to memory accounting [v27,20/31] mm: Update can_follow_write_pte() for shadow stack [v27,21/31] mm/mprotect: Exclude shadow stack from preserve_write [v27,22/31] mm: Re-introduce vm_flags to do_mmap() [v27,23/31] x86/cet/shstk: Add user-mode shadow stack support [v27,24/31] x86/cet/shstk: Handle thread shadow stack [v27,25/31] x86/cet/shstk: Introduce shadow stack token setup/verify routines [v27,26/31] x86/cet/shstk: Handle signals for shadow stack [v27,27/31] ELF: Introduce arch_setup_elf_property() [v27,28/31] x86/cet/shstk: Add arch_prctl functions for shadow stack [v27,29/31] mm: Move arch_calc_vm_prot_bits() to arch/x86/include/asm/mman.h [v27,30/31] mm: Update arch_validate_flags() to test vma anonymous [v27,31/31] mm: Introduce PROT_SHADOW_STACK for shadow stack

Yu-cheng Yu May 21, 2021, 10:12 p.m. UTC

For clone() with CLONE_VM, except vfork, the child and the parent must have
separate shadow stacks.  Thus, the kernel allocates, and frees on thread
exit a new shadow stack for the child.

Use stack_size passed from clone3() syscall for thread shadow stack size.
A compat-mode thread shadow stack size is further reduced to 1/4.  This
allows more threads to run in a 32-bit address space.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/cet.h         |  5 +++
 arch/x86/include/asm/mmu_context.h |  3 ++
 arch/x86/kernel/process.c          | 15 +++++---
 arch/x86/kernel/shstk.c            | 55 +++++++++++++++++++++++++++++-
 4 files changed, 73 insertions(+), 5 deletions(-)

Andy Lutomirski May 22, 2021, 11:39 p.m. UTC | #1

On Fri, May 21, 2021 at 3:14 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 5ea2b494e9f9..8e5f772181b9 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -71,6 +71,53 @@ int shstk_setup(void)
>         return 0;
>  }
>
> +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> +                            unsigned long stack_size)
> +{

...

> +       state = get_xsave_addr(&tsk->thread.fpu.state.xsave, XFEATURE_CET_USER);
> +       if (!state)
> +               return -EINVAL;
> +

The get_xsave_addr() API is horrible, and we already have one
egregiously buggy instance in the kernel.  Let's not add another
dubious use case.

If state == NULL, this means that CET_USER is in its init state.
CET_USER in the init state should behave identically regardless of
whether XINUSE[CET_USER] is set.  Can you please adjust this code
accordingly?

Thanks,
Andy

Yu-cheng Yu May 25, 2021, 3:04 p.m. UTC | #2

On 5/22/2021 4:39 PM, Andy Lutomirski wrote:
> On Fri, May 21, 2021 at 3:14 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
>> index 5ea2b494e9f9..8e5f772181b9 100644
>> --- a/arch/x86/kernel/shstk.c
>> +++ b/arch/x86/kernel/shstk.c
>> @@ -71,6 +71,53 @@ int shstk_setup(void)
>>          return 0;
>>   }
>>
>> +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
>> +                            unsigned long stack_size)
>> +{
> 
> ...
> 
>> +       state = get_xsave_addr(&tsk->thread.fpu.state.xsave, XFEATURE_CET_USER);
>> +       if (!state)
>> +               return -EINVAL;
>> +
> 
> The get_xsave_addr() API is horrible, and we already have one
> egregiously buggy instance in the kernel.  Let's not add another
> dubious use case.
> 
> If state == NULL, this means that CET_USER is in its init state.
> CET_USER in the init state should behave identically regardless of
> whether XINUSE[CET_USER] is set.  Can you please adjust this code
> accordingly?
> 

I will work on that.

Thanks,
Yu-cheng

John Allen July 21, 2021, 6:14 p.m. UTC | #3

On Fri, May 21, 2021 at 03:12:04PM -0700, Yu-cheng Yu wrote:
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 5ea2b494e9f9..8e5f772181b9 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -71,6 +71,53 @@ int shstk_setup(void)
>  	return 0;
>  }
>  
> +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> +			     unsigned long stack_size)
> +{
> +	struct thread_shstk *shstk = &tsk->thread.shstk;
> +	struct cet_user_state *state;
> +	unsigned long addr;
> +
> +	if (!stack_size)
> +		return -EINVAL;

I've been doing some light testing on AMD hardware and I've found that
this version of the patchset doesn't boot for me. It appears that when
systemd processes start spawning, they hit the above case, return
-EINVAL, and the fork fails. In these cases, copy_thread has been passed
0 for both sp and stack_size.

For previous versions of the patchset, I can still boot. When the
stack_size check was last, the function would always return before
completing the check, hitting one of the two cases below.

At the very least, it would seem that on some systems, it isn't valid to
rely on the stack_size passed from clone3, though I'm unsure what the
correct behavior should be here. If the passed stack_size == 0 and sp ==
0, is this a case where we want to alloc a shadow stack for this thread
with some capped size? Alternatively, is this a case that isn't valid to
alloc a shadow stack and we should simply return 0 instead of -EINVAL?

I'm running Fedora 34 which satisfies the required versions of gcc,
binutils, and glibc.

Please let me know if there is any additional information I can provide.

Thanks,
John

> +
> +	if (!shstk->size)
> +		return 0;
> +
> +	/*
> +	 * For CLONE_VM, except vfork, the child needs a separate shadow
> +	 * stack.
> +	 */
> +	if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
> +		return 0;
> +
> +	state = get_xsave_addr(&tsk->thread.fpu.state.xsave, XFEATURE_CET_USER);
> +	if (!state)
> +		return -EINVAL;
> +
> +	/*
> +	 * Compat-mode pthreads share a limited address space.
> +	 * If each function call takes an average of four slots
> +	 * stack space, allocate 1/4 of stack size for shadow stack.
> +	 */
> +	if (in_compat_syscall())
> +		stack_size /= 4;
> +
> +	stack_size = round_up(stack_size, PAGE_SIZE);
> +	addr = alloc_shstk(stack_size);
> +	if (IS_ERR_VALUE(addr)) {
> +		shstk->base = 0;
> +		shstk->size = 0;
> +		return PTR_ERR((void *)addr);
> +	}
> +
> +	fpu__prepare_write(&tsk->thread.fpu);
> +	state->user_ssp = (u64)(addr + stack_size);
> +	shstk->base = addr;
> +	shstk->size = stack_size;
> +	return 0;
> +}
> +
>  void shstk_free(struct task_struct *tsk)
>  {
>  	struct thread_shstk *shstk = &tsk->thread.shstk;
> @@ -80,7 +127,13 @@ void shstk_free(struct task_struct *tsk)
>  	    !shstk->base)
>  		return;
>  
> -	if (!tsk->mm)
> +	/*
> +	 * When fork() with CLONE_VM fails, the child (tsk) already has a
> +	 * shadow stack allocated, and exit_thread() calls this function to
> +	 * free it.  In this case the parent (current) and the child share
> +	 * the same mm struct.
> +	 */
> +	if (!tsk->mm || tsk->mm != current->mm)
>  		return;
>  
>  	while (1) {

Florian Weimer July 21, 2021, 6:28 p.m. UTC | #4

* John Allen:

> At the very least, it would seem that on some systems, it isn't valid to
> rely on the stack_size passed from clone3, though I'm unsure what the
> correct behavior should be here. If the passed stack_size == 0 and sp ==
> 0, is this a case where we want to alloc a shadow stack for this thread
> with some capped size? Alternatively, is this a case that isn't valid to
> alloc a shadow stack and we should simply return 0 instead of -EINVAL?
>
> I'm running Fedora 34 which satisfies the required versions of gcc,
> binutils, and glibc.

Fedora 34 doesn't use clone3 yet.  You can upgrade to a rawhide build,
e.g. glibc-2.33.9000-46.fc35:

  <https://koji.fedoraproject.org/koji/buildinfo?buildID=1782678>

It's currently not in main rawhide because the Firefox sandbox breaks
clone3.  The “fix” is that clone3 will fail with ENOSYS under the
sandbox.

I expect that container runtimes turn clone3 into clone in the same way
(via ENOSYS), at least for the medium term.  So it would make sense to
allocate some sort of shadow stack for clone as well, if that's possible
to implement in some way.

Thanks,
Florian

Yu-cheng Yu July 21, 2021, 6:34 p.m. UTC | #5

On 7/21/2021 11:28 AM, Florian Weimer wrote:
> * John Allen:
> 
>> At the very least, it would seem that on some systems, it isn't valid to
>> rely on the stack_size passed from clone3, though I'm unsure what the
>> correct behavior should be here. If the passed stack_size == 0 and sp ==
>> 0, is this a case where we want to alloc a shadow stack for this thread
>> with some capped size? Alternatively, is this a case that isn't valid to
>> alloc a shadow stack and we should simply return 0 instead of -EINVAL?
>>
>> I'm running Fedora 34 which satisfies the required versions of gcc,
>> binutils, and glibc.
> 
> Fedora 34 doesn't use clone3 yet.  You can upgrade to a rawhide build,
> e.g. glibc-2.33.9000-46.fc35:
> 
>    <https://koji.fedoraproject.org/koji/buildinfo?buildID=1782678>
> 
> It's currently not in main rawhide because the Firefox sandbox breaks
> clone3.  The “fix” is that clone3 will fail with ENOSYS under the
> sandbox.
> 
> I expect that container runtimes turn clone3 into clone in the same way
> (via ENOSYS), at least for the medium term.  So it would make sense to
> allocate some sort of shadow stack for clone as well, if that's possible
> to implement in some way.
> 
> Thanks,
> Florian
> 

Thanks Florian!  And because of that reason, we will put back clone2 
support in my next v28 patches.

Yu-cheng

Dave Hansen July 21, 2021, 6:37 p.m. UTC | #6

On 7/21/21 11:14 AM, John Allen wrote:
>> +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
>> +			     unsigned long stack_size)
>> +{
>> +	struct thread_shstk *shstk = &tsk->thread.shstk;
>> +	struct cet_user_state *state;
>> +	unsigned long addr;
>> +
>> +	if (!stack_size)
>> +		return -EINVAL;
> I've been doing some light testing on AMD hardware and I've found that
> this version of the patchset doesn't boot for me. It appears that when
> systemd processes start spawning, they hit the above case, return
> -EINVAL, and the fork fails. In these cases, copy_thread has been passed
> 0 for both sp and stack_size.

A few tangential things I noticed:

This hunk is not mentioned in the version changelog at all.  I also
don't see any feedback that might have prompted it.  This is one reason
per-patch changelogs are preferred.

As a general rule, new features should strive to be implemented in a way
that it's *obvious* that they won't break old code.
shstk_alloc_thread_stack() fails that test for me.  If it had:

	if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) // or whatever
		return 0;

in the function, it would be obviously harmless.  Better yet would be
doing the feature check at the shstk_alloc_thread_stack() call site,
that way even the function call can be optimized out.

Further, this confused me because the changelog didn't even mention the
arg -> stack_size rename.  That would have been nice for another patch,
or an extra sentence in the changelog.

H.J. Lu July 21, 2021, 8:14 p.m. UTC | #7

On Wed, Jul 21, 2021 at 11:15 AM John Allen <john.allen@amd.com> wrote:
>
> On Fri, May 21, 2021 at 03:12:04PM -0700, Yu-cheng Yu wrote:
> > diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> > index 5ea2b494e9f9..8e5f772181b9 100644
> > --- a/arch/x86/kernel/shstk.c
> > +++ b/arch/x86/kernel/shstk.c
> > @@ -71,6 +71,53 @@ int shstk_setup(void)
> >       return 0;
> >  }
> >
> > +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> > +                          unsigned long stack_size)
> > +{
> > +     struct thread_shstk *shstk = &tsk->thread.shstk;
> > +     struct cet_user_state *state;
> > +     unsigned long addr;
> > +
> > +     if (!stack_size)
> > +             return -EINVAL;
>
> I've been doing some light testing on AMD hardware and I've found that
> this version of the patchset doesn't boot for me. It appears that when
> systemd processes start spawning, they hit the above case, return
> -EINVAL, and the fork fails. In these cases, copy_thread has been passed
> 0 for both sp and stack_size.
>
> For previous versions of the patchset, I can still boot. When the
> stack_size check was last, the function would always return before
> completing the check, hitting one of the two cases below.
>
> At the very least, it would seem that on some systems, it isn't valid to
> rely on the stack_size passed from clone3, though I'm unsure what the
> correct behavior should be here. If the passed stack_size == 0 and sp ==
> 0, is this a case where we want to alloc a shadow stack for this thread
> with some capped size? Alternatively, is this a case that isn't valid to
> alloc a shadow stack and we should simply return 0 instead of -EINVAL?
>
> I'm running Fedora 34 which satisfies the required versions of gcc,
> binutils, and glibc.
>
> Please let me know if there is any additional information I can provide.

FWIW, I have been maintaining stable CET kernels at:

https://github.com/hjl-tools/linux/

The current CET kernel is on hjl/cet/linux-5.13.y branch.

> Thanks,
> John
>
> > +
> > +     if (!shstk->size)
> > +             return 0;
> > +
> > +     /*
> > +      * For CLONE_VM, except vfork, the child needs a separate shadow
> > +      * stack.
> > +      */
> > +     if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
> > +             return 0;
> > +
> > +     state = get_xsave_addr(&tsk->thread.fpu.state.xsave, XFEATURE_CET_USER);
> > +     if (!state)
> > +             return -EINVAL;
> > +
> > +     /*
> > +      * Compat-mode pthreads share a limited address space.
> > +      * If each function call takes an average of four slots
> > +      * stack space, allocate 1/4 of stack size for shadow stack.
> > +      */
> > +     if (in_compat_syscall())
> > +             stack_size /= 4;
> > +
> > +     stack_size = round_up(stack_size, PAGE_SIZE);
> > +     addr = alloc_shstk(stack_size);
> > +     if (IS_ERR_VALUE(addr)) {
> > +             shstk->base = 0;
> > +             shstk->size = 0;
> > +             return PTR_ERR((void *)addr);
> > +     }
> > +
> > +     fpu__prepare_write(&tsk->thread.fpu);
> > +     state->user_ssp = (u64)(addr + stack_size);
> > +     shstk->base = addr;
> > +     shstk->size = stack_size;
> > +     return 0;
> > +}
> > +
> >  void shstk_free(struct task_struct *tsk)
> >  {
> >       struct thread_shstk *shstk = &tsk->thread.shstk;
> > @@ -80,7 +127,13 @@ void shstk_free(struct task_struct *tsk)
> >           !shstk->base)
> >               return;
> >
> > -     if (!tsk->mm)
> > +     /*
> > +      * When fork() with CLONE_VM fails, the child (tsk) already has a
> > +      * shadow stack allocated, and exit_thread() calls this function to
> > +      * free it.  In this case the parent (current) and the child share
> > +      * the same mm struct.
> > +      */
> > +     if (!tsk->mm || tsk->mm != current->mm)
> >               return;
> >
> >       while (1) {

John Allen July 28, 2021, 9:34 p.m. UTC | #8

On Wed, Jul 21, 2021 at 11:34:53AM -0700, Yu, Yu-cheng wrote:
> On 7/21/2021 11:28 AM, Florian Weimer wrote:
> > I expect that container runtimes turn clone3 into clone in the same way
> > (via ENOSYS), at least for the medium term.  So it would make sense to
> > allocate some sort of shadow stack for clone as well, if that's possible
> > to implement in some way.
> > 
> > Thanks,
> > Florian
> > 
> 
> Thanks Florian!  And because of that reason, we will put back clone2 support
> in my next v28 patches.
> 
> Yu-cheng

I tested with v28 of the patches on the same system and it appears to
fix the issue I was seeing.

Thanks,
John

[v27,24/31] x86/cet/shstk: Handle thread shadow stack

Commit Message

Comments

Patch