From patchwork Thu Jan 19 21:23:06 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rick Edgecombe X-Patchwork-Id: 13108806 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50950C6379F for ; Thu, 19 Jan 2023 21:24:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 21BE4280005; Thu, 19 Jan 2023 16:24:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1584C280001; Thu, 19 Jan 2023 16:24:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DDB20280005; Thu, 19 Jan 2023 16:24:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B63BC280001 for ; Thu, 19 Jan 2023 16:24:12 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 98BDB1C65BB for ; Thu, 19 Jan 2023 21:24:12 +0000 (UTC) X-FDA: 80372826744.08.9B84F4B Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf16.hostedemail.com (Postfix) with ESMTP id 6E755180016 for ; Thu, 19 Jan 2023 21:24:10 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Ozftb4jo; spf=pass (imf16.hostedemail.com: domain of rick.p.edgecombe@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=rick.p.edgecombe@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674163450; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references:dkim-signature; bh=LF9eWuGfmBXCMQNhJgKUQvTdsk+a5XWVtnwIrkM2vbU=; b=qCoHLWoDgohBHg6VK+GKlGuDvqtqaTRt77XUtaVsGPpzCtfKTlgwKEK9i2Xz4zyduuwbBS MghikIsAZevhXKJOHioRk7gXRsjsDW2HQbjcV/eLKUbgzHMvsWIyIcmX0XvP2Lspm5Hba4 LHKEQeHpdVBi1Z/N5gYL7zGX4g+Qewc= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Ozftb4jo; spf=pass (imf16.hostedemail.com: domain of rick.p.edgecombe@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=rick.p.edgecombe@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674163450; a=rsa-sha256; cv=none; b=3N0IYqQY56huw/JqAyU9DIzKMywHPp9r9baz5praWIVxXq22ldaS89PDKv2/o8kz+ZUd0o oXUEuVoocwhzYUmz3tA3OBhucWiPvvqtQm3wR2Fjepew9myYm/RVN4vpYVthYMgPLzP4xU dRusv4hMDCN+ZluOA8sm9sahGsFxymk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674163450; x=1705699450; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=sY0pQZGQdI/cNzp/wwjwE3pcURvBjqOaRcvBZzBOV/8=; b=Ozftb4joJ5KyPLaAWs+T2pt9HItokxRGaWi/yG3a9DPBCGJJ/QpuxFzH URZ7pA5JZ7ZGOVfgwWUmHkV8SocwHQxvRX7oB5CTlnkCbiIrnxLiHByiZ sfHn58TJHkUO6Ic6XEOyrhSBj/Js5aECy3mtBDOYpvCo0ZPJl77xoZbAm RNQVR10YhBkTmadZwCtQRcbJiESYBSCrcT+AszfpgqCaLUu44T3Ff+IDn Fi2QqIXfMDRlS9j9OxGa4aO1aXt7W/oyRm2ZRKea9X5unST7Y86jhBMel 3QN9/iGNWpNKn0pyQUJrPrliz5i0bBJYmXsOOZjRhYb6LhsxWwt4L6Fqg g==; X-IronPort-AV: E=McAfee;i="6500,9779,10595"; a="323119865" X-IronPort-AV: E=Sophos;i="5.97,230,1669104000"; d="scan'208";a="323119865" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jan 2023 13:24:09 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10595"; a="989139140" X-IronPort-AV: E=Sophos;i="5.97,230,1669104000"; d="scan'208";a="989139140" Received: from hossain3-mobl.amr.corp.intel.com (HELO rpedgeco-desk.amr.corp.intel.com) ([10.252.128.187]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jan 2023 13:24:08 -0800 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu Subject: [PATCH v5 28/39] x86/shstk: Handle thread shadow stack Date: Thu, 19 Jan 2023 13:23:06 -0800 Message-Id: <20230119212317.8324-29-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20230119212317.8324-1-rick.p.edgecombe@intel.com> References: <20230119212317.8324-1-rick.p.edgecombe@intel.com> X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 6E755180016 X-Stat-Signature: 7cdcbepbdn4m7qu65j7c54x9jgb4co6g X-Rspam-User: X-HE-Tag: 1674163450-702265 X-HE-Meta: U2FsdGVkX1+3u4TPMzmrm0vfAs9G9xGMvsX0BC1/Stbv9XPeAE+bZVAdZbUaXo7D4ehMEErirTIcEFqHB9S1VG6W1ybq9w/OpWF4te6+C9Prs1u3TlWixg1HBIFdmMxH/Hxw5O457/nX+4CPB/g6j7/QHQyU8q+kj1/BsZv92WDJVc4iRIHSr0TOIRJq+rbc6exfenhkrzE2HhWGB2BedZ+mbcoc1GtPm/zP9mG339/ZerE4mMG8+UID0LWGJ4MvIKPsKIu1eJp+oLqR12hMhZtvee+c7/Ss0Vm5q3pdniVVd6eRXJ753JEhcaemxSx4BFcprh54s06/54WFZLytdySuCmCaLtyPGcwD1SefcR0Vxjucf/Y/yuFpFUproN4B05kCEVeDMbZnAGM7Jzmf0hjd/CjxqLm2mFcjOYP6w0RGsIdUWc98nD4QOmkb1wD6BhVoToauX2ZveuOiavqevK0735nso+1hpa+YnmwbXdcCaIEkX6vjajcpN3UDKLDU8Rvf6qP+mtFag5MZmX7akpRZW5AF4MRZqvC06BjNnW1MykL19Hfqp1PS/p3bzs+ZxvV+MuwVMJqwGfQytipRaSH+d0+2+/Tc4INn08BeZnxiluTOyTJR5lWSnK4n0JOi2Kr74jKb+ERG+YNSeor4lB9uOSqWLBCecfU2qTW8sW5iNT1dJI9+kX2pYJ0jN3ynjWOeWNurT7n5z3Kl3V4DTRcoWj4j7ct+YpcpfNKuRKTrCB3ZGdQTk+H8QRF8HUhAa2Pb1XP3vPwk1gb4aZEaM76oJKQtuBcOqapja0ho6Go9BNd/8HQZbP2XC2avRtUjDCKeODXGC8tcJG9NyDV7TuM2D7F7/LRAEvauxyodftYmjqe5Cx+KNJWTOIXw+AphgR57OM1Uf3qtvddVMNO+iprRAD4/PaEer2LXagNC7bA3NCWf1wRaTDePiFfTHWcXvAYgaGRAL2K0F7c1qhi 4s+HKVyy jqNqpH/dXnMJIIikqbALSjo0eXXWuKKcUQjgfxY4cV9aCcPQOBk685CdoLXCzQf73mXYmVkEXKiv0wSOT6Cfh0IVcBv4RKfhk6vhNwq0M2lxvfNGhe6uXj3tTLMDtq2lqSR5QH/SHHKzDlR4LIxO68u2zixfsbT5JyjKQgWq0DFCwWnXlDLwqLw8g1M3TfzsqVe4aWsJUznwhiHj+oL4VhigaLF8JLYY1SkJxSr6BZyLkdhrU9cA6qb7KvcBLIlcm8H2jZPTcQGbH2BHSE9V+0xlVwCI1KcXC5Ij2 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Yu-cheng Yu When a process is duplicated, but the child shares the address space with the parent, there is potential for the threads sharing a single stack to cause conflicts for each other. In the normal non-cet case this is handled in two ways. With regular CLONE_VM a new stack is provided by userspace such that the parent and child have different stacks. For vfork, the parent is suspended until the child exits. So as long as the child doesn't return from the vfork()/CLONE_VFORK calling function and sticks to a limited set of operations, the parent and child can share the same stack. For shadow stack, these scenarios present similar sharing problems. For the CLONE_VM case, the child and the parent must have separate shadow stacks. Instead of changing clone to take a shadow stack, have the kernel just allocate one and switch to it. Use stack_size passed from clone3() syscall for thread shadow stack size. A compat-mode thread shadow stack size is further reduced to 1/4. This allows more threads to run in a 32-bit address space. The clone() does not pass stack_size, which was added to clone3(). In that case, use RLIMIT_STACK size and cap to 4 GB. For shadow stack enabled vfork(), the parent and child can share the same shadow stack, like they can share a normal stack. Since the parent is suspended until the child terminates, the child will not interfere with the parent while executing as long as it doesn't return from the vfork() and overwrite up the shadow stack. The child can safely overwrite down the shadow stack, as the parent can just overwrite this later. So CET does not add any additional limitations for vfork(). Userspace implementing posix vfork() can actually prevent the child from returning from the vfork() calling function, using CET. Glibc does this by adjusting the shadow stack pointer in the child, so that the child receives a #CP if it tries to return from vfork() calling function. Free the shadow stack on thread exit by doing it in mm_release(). Skip this when exiting a vfork() child since the stack is shared in the parent. During this operation, the shadow stack pointer of the new thread needs to be updated to point to the newly allocated shadow stack. Since the ability to do this is confined to the FPU subsystem, change fpu_clone() to take the new shadow stack pointer, and update it internally inside the FPU subsystem. This part was suggested by Thomas Gleixner. Reviewed-by: Kees Cook Tested-by: Pengfei Xu Tested-by: John Allen Suggested-by: Thomas Gleixner Signed-off-by: Yu-cheng Yu Co-developed-by: Rick Edgecombe Signed-off-by: Rick Edgecombe --- v3: - Fix update_fpu_shstk() stub (Mike Rapoport) - Fix chunks around alloc_shstk() in wrong patch (Kees) - Fix stack_size/flags swap (Kees) - Use centalized stack size logic (Kees) v2: - Have fpu_clone() take new shadow stack pointer and update SSP in xsave buffer for new task. (tglx) v1: - Expand commit log. - Add more comments. - Switch to xsave helpers. Yu-cheng v30: - Update comments about clone()/clone3(). (Borislav Petkov) arch/x86/include/asm/fpu/sched.h | 3 +- arch/x86/include/asm/mmu_context.h | 2 ++ arch/x86/include/asm/shstk.h | 7 +++++ arch/x86/kernel/fpu/core.c | 41 +++++++++++++++++++++++++++- arch/x86/kernel/process.c | 18 +++++++++++- arch/x86/kernel/shstk.c | 44 ++++++++++++++++++++++++++++-- 6 files changed, 110 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h index b2486b2cbc6e..54c9c2fd1907 100644 --- a/arch/x86/include/asm/fpu/sched.h +++ b/arch/x86/include/asm/fpu/sched.h @@ -11,7 +11,8 @@ extern void save_fpregs_to_fpstate(struct fpu *fpu); extern void fpu__drop(struct fpu *fpu); -extern int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal); +extern int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal, + unsigned long shstk_addr); extern void fpu_flush_thread(void); /* diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index e01aa74a6de7..9714f08d941b 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -147,6 +147,8 @@ do { \ #else #define deactivate_mm(tsk, mm) \ do { \ + if (!tsk->vfork_done) \ + shstk_free(tsk); \ load_gs_index(0); \ loadsegment(fs, 0); \ } while (0) diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h index f40414a982e8..172a69052770 100644 --- a/arch/x86/include/asm/shstk.h +++ b/arch/x86/include/asm/shstk.h @@ -15,11 +15,18 @@ struct thread_shstk { long shstk_prctl(struct task_struct *task, int option, unsigned long features); void reset_thread_features(void); +int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags, + unsigned long stack_size, + unsigned long *shstk_addr); void shstk_free(struct task_struct *p); #else static inline long shstk_prctl(struct task_struct *task, int option, unsigned long features) { return -EINVAL; } static inline void reset_thread_features(void) {} +static inline int shstk_alloc_thread_stack(struct task_struct *p, + unsigned long clone_flags, + unsigned long stack_size, + unsigned long *shstk_addr) { return 0; } static inline void shstk_free(struct task_struct *p) {} #endif /* CONFIG_X86_USER_SHADOW_STACK */ diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 7317bfd5ea36..c72262479f03 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu) } } +#ifdef CONFIG_X86_USER_SHADOW_STACK +static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp) +{ + struct cet_user_state *xstate; + + /* If ssp update is not needed. */ + if (!ssp) + return 0; + + xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave, + XFEATURE_CET_USER); + + /* + * If there is a non-zero ssp, then 'dst' must be configured with a shadow + * stack and the fpu state should be up to date since it was just copied + * from the parent in fpu_clone(). So there must be a valid non-init CET + * state location in the buffer. + */ + if (WARN_ON_ONCE(!xstate)) + return 1; + + xstate->user_ssp = (u64)ssp; + + return 0; +} +#else +static int update_fpu_shstk(struct task_struct *dst, unsigned long shstk_addr) +{ + return 0; +} +#endif + /* Clone current's FPU state on fork */ -int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal) +int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal, + unsigned long ssp) { struct fpu *src_fpu = ¤t->thread.fpu; struct fpu *dst_fpu = &dst->thread.fpu; @@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal) if (use_xsave()) dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID; + /* + * Update shadow stack pointer, in case it changed during clone. + */ + if (update_fpu_shstk(dst, ssp)) + return 1; + trace_x86_fpu_copy_src(src_fpu); trace_x86_fpu_copy_dst(dst_fpu); diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index e57cd31bfec4..13a0a81d70b9 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -48,6 +48,7 @@ #include #include #include +#include #include "process.h" @@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk) free_vm86(t); + shstk_free(tsk); fpu__drop(fpu); } @@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args) struct inactive_task_frame *frame; struct fork_frame *fork_frame; struct pt_regs *childregs; + unsigned long shstk_addr = 0; int ret = 0; childregs = task_pt_regs(p); @@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args) frame->flags = X86_EFLAGS_FIXED; #endif - fpu_clone(p, clone_flags, args->fn); + /* Allocate a new shadow stack for pthread if needed */ + ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size, + &shstk_addr); + if (ret) + return ret; + + fpu_clone(p, clone_flags, args->fn, shstk_addr); /* Kernel thread ? */ if (unlikely(p->flags & PF_KTHREAD)) { @@ -220,6 +229,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args) if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP))) io_bitmap_share(p); + /* + * If copy_thread() if failing, don't leak the shadow stack possibly + * allocated in shstk_alloc_thread_stack() above. + */ + if (ret) + shstk_free(p); + return ret; } diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index f39e5d3b9303..111ea56115d2 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -47,7 +47,7 @@ static unsigned long alloc_shstk(unsigned long size) unsigned long addr, unused; mmap_write_lock(mm); - addr = do_mmap(NULL, addr, size, PROT_READ, flags, + addr = do_mmap(NULL, 0, size, PROT_READ, flags, VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); mmap_write_unlock(mm); @@ -126,6 +126,40 @@ void reset_thread_features(void) current->thread.features_locked = 0; } +int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags, + unsigned long stack_size, unsigned long *shstk_addr) +{ + struct thread_shstk *shstk = &tsk->thread.shstk; + unsigned long addr, size; + + /* + * If shadow stack is not enabled on the new thread, skip any + * switch to a new shadow stack. + */ + if (!features_enabled(ARCH_SHSTK_SHSTK)) + return 0; + + /* + * For CLONE_VM, except vfork, the child needs a separate shadow + * stack. + */ + if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM) + return 0; + + + size = adjust_shstk_size(stack_size); + addr = alloc_shstk(size); + if (IS_ERR_VALUE(addr)) + return PTR_ERR((void *)addr); + + shstk->base = addr; + shstk->size = size; + + *shstk_addr = addr + size; + + return 0; +} + void shstk_free(struct task_struct *tsk) { struct thread_shstk *shstk = &tsk->thread.shstk; @@ -134,7 +168,13 @@ void shstk_free(struct task_struct *tsk) !features_enabled(ARCH_SHSTK_SHSTK)) return; - if (!tsk->mm) + /* + * When fork() with CLONE_VM fails, the child (tsk) already has a + * shadow stack allocated, and exit_thread() calls this function to + * free it. In this case the parent (current) and the child share + * the same mm struct. + */ + if (!tsk->mm || tsk->mm != current->mm) return; unmap_shadow_stack(shstk->base, shstk->size);