From patchwork Mon Apr 3 14:05:59 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lai Jiangshan X-Patchwork-Id: 13198224 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4270CC76196 for ; Mon, 3 Apr 2023 14:05:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232787AbjDCOFV (ORCPT ); Mon, 3 Apr 2023 10:05:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41712 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232792AbjDCOFQ (ORCPT ); Mon, 3 Apr 2023 10:05:16 -0400 Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com [IPv6:2607:f8b0:4864:20::102f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1440E2B0E0; Mon, 3 Apr 2023 07:05:04 -0700 (PDT) Received: by mail-pj1-x102f.google.com with SMTP id qe8-20020a17090b4f8800b0023f07253a2cso30627976pjb.3; Mon, 03 Apr 2023 07:05:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680530703; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=vhwgcDtAklxhqM6EYQPuvUq6965STEsFa04g+nFznic=; b=Hel87nL2p8AdJk1YAXlMZNr6u+Kyo+BZ3Q4c7lJ4uY9DgXLnyJhjYijHU8kvz4xTAl lvAe0p7Mmzkp4d4j5eq4FfqeqenxwXTf6z2NGqxgESMZHL/fL2y+xbXQGxCHVFA9xo4h V0Snthjpt0jP4POzvJlQ4FNwbaDHXHQcZR6w0dcT8iFP11VdsmMpwqnY/5CjCblOgYmu cAC62agxmrtJ16rqlfD12Hb0rMFxVy18UwEQriL3uiNa6qhISCkNJZy5my1Ebae43KOD 0hIvN8ny4TGjXoVj8JkjLtzY26lPcfLSwNQhYG1jC8J406wFY/zqTiEvJBNKfGIrgaVV quzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680530703; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vhwgcDtAklxhqM6EYQPuvUq6965STEsFa04g+nFznic=; b=LtmD4KceTCw/C1xurdFi2y1p3hTi6KkuUYzmKk3mgN6pnT9hLZjbuFcRpw7pFRx6lG 2A25YzHcC+oYDxtfwYuPhPFjCyN+/OyXcevam3hDNhl4RZxdDQdzu0IKFlI5Gb/zFwRO 5dL+Uc10/jxuVlSeqGpPZjocHWr+LBCgBmjkKQ2K0ms7CyIT8X0p6KjfJcTad+18EAYq xQHHP80J4SEgL3KO2YPGHUSHBdes288DqFngt9VsM9SCwo9Yya9uONMCJJNLv3H2NS/n S3cxwGpdc7OaI/JWSdJ531yHf078O3goeyi6962EnI+/td9R17tcFH41EU+XZEI+eU92 2i+A== X-Gm-Message-State: AAQBX9cjGZkCU/DNKRzjm5ht1uY20c3Gk7P9/J/Lhf3PKjGDB/9L5X/m EPZ3798o8lVRIfQ6MtfByhIPVBmF+Vs= X-Google-Smtp-Source: AKy350YnS84lfoLQEq2h2XyjaFwgl6Kx+j2bUv9CN4OsPSXeoGN8xOUShHAKm7jSA2d910IbQAmmiw== X-Received: by 2002:a17:902:d4c6:b0:1a1:cc5a:b07 with SMTP id o6-20020a170902d4c600b001a1cc5a0b07mr45012190plg.37.1680530702603; Mon, 03 Apr 2023 07:05:02 -0700 (PDT) Received: from localhost ([47.89.225.180]) by smtp.gmail.com with ESMTPSA id b19-20020a170902d89300b001a072be70desm6696500plz.41.2023.04.03.07.05.01 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Apr 2023 07:05:02 -0700 (PDT) From: Lai Jiangshan To: linux-kernel@vger.kernel.org Cc: Lai Jiangshan , "H. Peter Anvin" , Andi Kleen , Andrew Cooper , Andy Lutomirski , Asit Mallick , Cfir Cohen , Dan Williams , Dave Hansen , David Kaplan , David Rientjes , Dirk Hohndel , Erdem Aktas , Jan Kiszka , Jiri Slaby , Joerg Roedel , Juergen Gross , Kees Cook , Kirill Shutemov , Kuppuswamy Sathyanarayanan , Linus Torvalds , Mike Stunes , Peter Zijlstra , Raj Ashok , Sean Christopherson , Thomas Gleixner , Tom Lendacky , Tony Luck , kvm@vger.kernel.org, linux-coco@lists.linux.dev, x86@kernel.org, Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" Subject: [RFC PATCH 1/7] x86/entry: Move PUSH_AND_CLEAR_REGS out of paranoid_entry Date: Mon, 3 Apr 2023 22:05:59 +0800 Message-Id: <20230403140605.540512-2-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20230403140605.540512-1-jiangshanlai@gmail.com> References: <20230403140605.540512-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Lai Jiangshan Prepare to call paranoid_entry() with pt_regs pushed. Signed-off-by: Lai Jiangshan --- arch/x86/entry/entry_64.S | 33 ++++++++++++++++++++------------- 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index eccc3431e515..49ddc4dd3117 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -475,11 +475,13 @@ SYM_CODE_START(\asmsym) testb $3, CS-ORIG_RAX(%rsp) jnz .Lfrom_usermode_switch_stack_\@ + PUSH_AND_CLEAR_REGS + UNWIND_HINT_REGS + ENCODE_FRAME_POINTER + /* paranoid_entry returns GS information for paranoid_exit in EBX. */ call paranoid_entry - UNWIND_HINT_REGS - movq %rsp, %rdi /* pt_regs pointer */ call \cfunc @@ -530,14 +532,16 @@ SYM_CODE_START(\asmsym) testb $3, CS-ORIG_RAX(%rsp) jnz .Lfrom_usermode_switch_stack_\@ + PUSH_AND_CLEAR_REGS + UNWIND_HINT_REGS + ENCODE_FRAME_POINTER + /* * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX. * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS */ call paranoid_entry - UNWIND_HINT_REGS - /* * Switch off the IST stack to make it free for nested exceptions. The * vc_switch_off_ist() function will switch back to the interrupted @@ -587,9 +591,12 @@ SYM_CODE_START(\asmsym) ASM_CLAC cld + PUSH_AND_CLEAR_REGS + UNWIND_HINT_REGS + ENCODE_FRAME_POINTER + /* paranoid_entry returns GS information for paranoid_exit in EBX. */ call paranoid_entry - UNWIND_HINT_REGS movq %rsp, %rdi /* pt_regs pointer into first argument */ movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/ @@ -903,8 +910,8 @@ SYM_CODE_END(xen_failsafe_callback) #endif /* CONFIG_XEN_PV */ /* - * Save all registers in pt_regs. Return GSBASE related information - * in EBX depending on the availability of the FSGSBASE instructions: + * Return GSBASE related information in EBX depending on the availability + * of the FSGSBASE instructions: * * FSGSBASE R/EBX * N 0 -> SWAPGS on exit @@ -915,11 +922,8 @@ SYM_CODE_END(xen_failsafe_callback) * R14 - old CR3 * R15 - old SPEC_CTRL */ -SYM_CODE_START(paranoid_entry) +SYM_FUNC_START(paranoid_entry) ANNOTATE_NOENDBR - UNWIND_HINT_FUNC - PUSH_AND_CLEAR_REGS save_ret=1 - ENCODE_FRAME_POINTER 8 /* * Always stash CR3 in %r14. This value will be restored, @@ -988,7 +992,7 @@ SYM_CODE_START(paranoid_entry) UNTRAIN_RET_FROM_CALL RET -SYM_CODE_END(paranoid_entry) +SYM_FUNC_END(paranoid_entry) /* * "Paranoid" exit path from exception stack. This is invoked @@ -1443,6 +1447,10 @@ end_repeat_nmi: */ pushq $-1 /* ORIG_RAX: no syscall to restart */ + PUSH_AND_CLEAR_REGS + UNWIND_HINT_REGS + ENCODE_FRAME_POINTER + /* * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit * as we should not be calling schedule in NMI context. @@ -1451,7 +1459,6 @@ end_repeat_nmi: * exceptions might do. */ call paranoid_entry - UNWIND_HINT_REGS movq %rsp, %rdi movq $-1, %rsi From patchwork Mon Apr 3 14:06:00 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lai Jiangshan X-Patchwork-Id: 13198225 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75F75C761AF for ; Mon, 3 Apr 2023 14:05:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232793AbjDCOFk (ORCPT ); Mon, 3 Apr 2023 10:05:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42322 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232805AbjDCOFc (ORCPT ); Mon, 3 Apr 2023 10:05:32 -0400 Received: from mail-pj1-x102d.google.com (mail-pj1-x102d.google.com [IPv6:2607:f8b0:4864:20::102d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 73BD227825; Mon, 3 Apr 2023 07:05:10 -0700 (PDT) Received: by mail-pj1-x102d.google.com with SMTP id q102so27289296pjq.3; Mon, 03 Apr 2023 07:05:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680530709; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=pP1y+2npNFi89ug0/IC9s4GPFuEkE96kXvViHM8S82Y=; b=LYgZJSU2LQVRKrVl3gJ6LmbsVBK8qTWoaiMbyLBWehqV4nTSQiAgTT3PYjLLV4+aRl U/o2j8y3HhKy0BXoBFaF91+h1zy+ksM1f5aJ9S4ql1WBgMBz3UXtr2MiXM2Oqk+CK9Bh FpVuJHo+ShHAPoJPVbr1/j76Tn0xm9zDBoofWk1lh+VyAVl/OMjaduJ/HPanBKTvzPK5 w/3MvnrMud4MqCovVyBgPjuXkhxKBF3jEvDvzwRM1AVOIlg+3yXdxjK+D1u8spJ+bjpo 0i99aQYTXGrHlujBFga9HLJn/awlfmA7b9pLZkKhysVnQua51WB9SVYgu/B7BPwYOvba Yd/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680530709; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pP1y+2npNFi89ug0/IC9s4GPFuEkE96kXvViHM8S82Y=; b=nwAK7cguxb7/Pwc+zB6WxBgOrgFY/3Pr5KD8a/6iqwCY3VTecyU/kUhZi3zqmN5cpl n6RltTj2Oq7v/m+dIRP5NPHtMSyYgZJ3kebwUv1ZDz41GMQF3J5Io92+E4Gna+bvcc7V gE1Mt8fUx+t0PmONUSJc8H0yp/VH83i9T7rUOBh1ahGJphGehpEA0K0pGm3j0ivOCZBo 40hHQFhz5IXn1xN+T3OJGjyg2oMcX4Rd1hvrBaxorQgFPm6MgdhijKfthMLGBvmc8rtq EnxgojfPDLps4aDgYwVB0EgQCqIRXFzOfiA0MjCZ8M663xA2zcb7aa8mzgeibJ6Wa/eb TBRA== X-Gm-Message-State: AAQBX9cfzaZmYW+mokRoHHUq5kb3bB4e4fC7KCGJanSZPIoto66xP6QL RMUNCV6GuzfGlgEg0h/z2oAjL7AWFQw= X-Google-Smtp-Source: AKy350ZPMTNT8j/gFua35RH6Y9rBKALeYSJ1nl6jWYeCMkrNYNAcbUZmVOrD3GaqTK0xuYAKNZrY2A== X-Received: by 2002:a17:902:fa8d:b0:19f:30be:ea0d with SMTP id lc13-20020a170902fa8d00b0019f30beea0dmr30070199plb.62.1680530709345; Mon, 03 Apr 2023 07:05:09 -0700 (PDT) Received: from localhost ([198.11.176.14]) by smtp.gmail.com with ESMTPSA id u8-20020a170902a60800b001a045f45d49sm6638401plq.281.2023.04.03.07.05.08 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Apr 2023 07:05:08 -0700 (PDT) From: Lai Jiangshan To: linux-kernel@vger.kernel.org Cc: Lai Jiangshan , "H. Peter Anvin" , Andi Kleen , Andrew Cooper , Andy Lutomirski , Asit Mallick , Cfir Cohen , Dan Williams , Dave Hansen , David Kaplan , David Rientjes , Dirk Hohndel , Erdem Aktas , Jan Kiszka , Jiri Slaby , Joerg Roedel , Juergen Gross , Kees Cook , Kirill Shutemov , Kuppuswamy Sathyanarayanan , Linus Torvalds , Mike Stunes , Peter Zijlstra , Raj Ashok , Sean Christopherson , Thomas Gleixner , Tom Lendacky , Tony Luck , kvm@vger.kernel.org, linux-coco@lists.linux.dev, x86@kernel.org, Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Jonathan Corbet , linux-doc@vger.kernel.org Subject: [RFC PATCH 2/7] x86/entry: Add IST main stack Date: Mon, 3 Apr 2023 22:06:00 +0800 Message-Id: <20230403140605.540512-3-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20230403140605.540512-1-jiangshanlai@gmail.com> References: <20230403140605.540512-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Lai Jiangshan Add IST main stack for atomic-IST-entry. The size is THREAD_SIZE since there might be multiple super exceptions being handled on the stack. Signed-off-by: Lai Jiangshan --- Documentation/x86/kernel-stacks.rst | 2 ++ arch/x86/include/asm/cpu_entry_area.h | 5 +++++ arch/x86/kernel/dumpstack_64.c | 6 ++++-- arch/x86/mm/cpu_entry_area.c | 1 + 4 files changed, 12 insertions(+), 2 deletions(-) diff --git a/Documentation/x86/kernel-stacks.rst b/Documentation/x86/kernel-stacks.rst index 6b0bcf027ff1..be89acf302da 100644 --- a/Documentation/x86/kernel-stacks.rst +++ b/Documentation/x86/kernel-stacks.rst @@ -105,6 +105,8 @@ The currently assigned IST stacks are: middle of switching stacks. Using IST for MCE events avoids making assumptions about the previous state of the kernel stack. +* ESTACK_IST. bla bla + For more details see the Intel IA32 or AMD AMD64 architecture manuals. diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h index 462fc34f1317..a373e8c37e25 100644 --- a/arch/x86/include/asm/cpu_entry_area.h +++ b/arch/x86/include/asm/cpu_entry_area.h @@ -10,6 +10,8 @@ #ifdef CONFIG_X86_64 +#define IST_MAIN_STKSZ THREAD_SIZE + #ifdef CONFIG_AMD_MEM_ENCRYPT #define VC_EXCEPTION_STKSZ EXCEPTION_STKSZ #else @@ -30,6 +32,8 @@ char VC_stack[optional_stack_size]; \ char VC2_stack_guard[guardsize]; \ char VC2_stack[optional_stack_size]; \ + char IST_stack_guard[guardsize]; \ + char IST_stack[IST_MAIN_STKSZ]; \ char IST_top_guard[guardsize]; \ /* The exception stacks' physical storage. No guard pages required */ @@ -52,6 +56,7 @@ enum exception_stack_ordering { ESTACK_MCE, ESTACK_VC, ESTACK_VC2, + ESTACK_IST, N_EXCEPTION_STACKS }; diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c index f05339fee778..3413b23fa9f1 100644 --- a/arch/x86/kernel/dumpstack_64.c +++ b/arch/x86/kernel/dumpstack_64.c @@ -26,11 +26,12 @@ static const char * const exception_stack_names[] = { [ ESTACK_MCE ] = "#MC", [ ESTACK_VC ] = "#VC", [ ESTACK_VC2 ] = "#VC2", + [ ESTACK_IST ] = "#IST", }; const char *stack_type_name(enum stack_type type) { - BUILD_BUG_ON(N_EXCEPTION_STACKS != 6); + BUILD_BUG_ON(N_EXCEPTION_STACKS != ARRAY_SIZE(exception_stack_names)); if (type == STACK_TYPE_TASK) return "TASK"; @@ -89,6 +90,7 @@ struct estack_pages estack_pages[CEA_ESTACK_PAGES] ____cacheline_aligned = { EPAGERANGE(MCE), EPAGERANGE(VC), EPAGERANGE(VC2), + EPAGERANGE(IST), }; static __always_inline bool in_exception_stack(unsigned long *stack, struct stack_info *info) @@ -98,7 +100,7 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac struct pt_regs *regs; unsigned int k; - BUILD_BUG_ON(N_EXCEPTION_STACKS != 6); + BUILD_BUG_ON(N_EXCEPTION_STACKS != 7); begin = (unsigned long)__this_cpu_read(cea_exception_stacks); /* diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c index 7316a8224259..62341cb819ab 100644 --- a/arch/x86/mm/cpu_entry_area.c +++ b/arch/x86/mm/cpu_entry_area.c @@ -148,6 +148,7 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu) cea_map_stack(NMI); cea_map_stack(DB); cea_map_stack(MCE); + cea_map_stack(IST); if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) { if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) { From patchwork Mon Apr 3 14:06:01 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lai Jiangshan X-Patchwork-Id: 13198226 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC6B6C761A6 for ; Mon, 3 Apr 2023 14:05:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232805AbjDCOF5 (ORCPT ); Mon, 3 Apr 2023 10:05:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43310 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232172AbjDCOFv (ORCPT ); Mon, 3 Apr 2023 10:05:51 -0400 Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com [IPv6:2607:f8b0:4864:20::1034]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BD4762CACB; Mon, 3 Apr 2023 07:05:21 -0700 (PDT) Received: by mail-pj1-x1034.google.com with SMTP id fy10-20020a17090b020a00b0023b4bcf0727so30681321pjb.0; Mon, 03 Apr 2023 07:05:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680530720; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=NxbMrHgICX+PzuOws0YA449+rnLgC+W+xeXNuZzARcg=; b=AR6Lx/9XbhMlTOKj0uW359PfePlcMPcLdb6lc6JN8f/haKItlJfyxvt9vLZ3WoAneV mmOFta9BIwUKS1ElbROBvfRXScc+D6GjEMnLFaLREJy80H9sDD+TECyU43x6me4Lbil2 sXcvAmzpgeRXKEbvY5tNUeU9J6KLOp9dW/v+NOuZP65fi6Cd9/EAcBc4dO2FsCJgfCQv IVq+3sY7E++imRfQPH2kZgCmLfpZ8kkGof5yKx9+a6dP3O4BOaHveoUmnnuBAhBxNTET 5Eyef5XHlX617RaeEfqGLIx+DrdpVrrUWgGm4lTXj8oVRKPdh56evZYz4O3G1NppsOAW /BsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680530720; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NxbMrHgICX+PzuOws0YA449+rnLgC+W+xeXNuZzARcg=; b=118Kx2pNxBCE8o9kG/YCGBguSBGkkVFVQ7187nIGf/5TfMCGeDQKd0Mccv/JdD8cDR t0y3NhUurJPGuCAQpXxQilEzJ63CoLnw5ycxhHEDU6UmdJ/rh1PsBYM5XcSC5eTjyhEt Ok8bHwghXz5nu/yVDMOSuPkeefsth3bCVD1mKr1UTgj++Kc2o+4qnrUMacy9Si/YrFLD gAKmAEXRTkqkdY8rTdSD4PFO2vgfXGDtdqiKyK8mNqqqdArA3zFuf5W4mP/ckuwM18N/ 1xGE5WxDtnn7q78Mq75vFkcenmlYN+fTGAYsW/d6pj8b5pm7sEgtCn4hfGgb7hEzjZ7C H1rA== X-Gm-Message-State: AAQBX9cga6CrkI/PURl02BOno2CWeLWGJiHUTyjMtfO2EnkQAOTn7p+K X5o31diSGpezxA4AA8uDDlshHaJt4+A= X-Google-Smtp-Source: AKy350a3RjUaiH4XHbLWoO9BC1c+N8qfMXhehIi/eun4+wHvqOk+gwENOtPBIk1BisFY8WmvRq6eAQ== X-Received: by 2002:a17:902:ce83:b0:1a1:ad5e:bdb2 with SMTP id f3-20020a170902ce8300b001a1ad5ebdb2mr45730746plg.57.1680530720217; Mon, 03 Apr 2023 07:05:20 -0700 (PDT) Received: from localhost ([47.88.5.130]) by smtp.gmail.com with ESMTPSA id w11-20020a170902d70b00b00186a2274382sm6639410ply.76.2023.04.03.07.05.19 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Apr 2023 07:05:19 -0700 (PDT) From: Lai Jiangshan To: linux-kernel@vger.kernel.org Cc: Lai Jiangshan , "H. Peter Anvin" , Andi Kleen , Andrew Cooper , Andy Lutomirski , Asit Mallick , Cfir Cohen , Dan Williams , Dave Hansen , David Kaplan , David Rientjes , Dirk Hohndel , Erdem Aktas , Jan Kiszka , Jiri Slaby , Joerg Roedel , Juergen Gross , Kees Cook , Kirill Shutemov , Kuppuswamy Sathyanarayanan , Linus Torvalds , Mike Stunes , Peter Zijlstra , Raj Ashok , Sean Christopherson , Thomas Gleixner , Tom Lendacky , Tony Luck , kvm@vger.kernel.org, linux-coco@lists.linux.dev, x86@kernel.org, Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Josh Poimboeuf , Arnd Bergmann Subject: [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry Date: Mon, 3 Apr 2023 22:06:01 +0800 Message-Id: <20230403140605.540512-4-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20230403140605.540512-1-jiangshanlai@gmail.com> References: <20230403140605.540512-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Lai Jiangshan See the comments in the cover-letter. They will be moved into the code and changelog here when improved. Signed-off-by: Lai Jiangshan --- arch/x86/entry/Makefile | 3 + arch/x86/entry/entry_64.S | 193 ++++++++++++++++++++ arch/x86/entry/ist_entry.c | 299 +++++++++++++++++++++++++++++++ arch/x86/kernel/asm-offsets_64.c | 7 + arch/x86/kernel/callthunks.c | 2 + tools/objtool/check.c | 7 +- 6 files changed, 510 insertions(+), 1 deletion(-) create mode 100644 arch/x86/entry/ist_entry.c diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile index ca2fe186994b..7cc1254ca519 100644 --- a/arch/x86/entry/Makefile +++ b/arch/x86/entry/Makefile @@ -8,11 +8,14 @@ UBSAN_SANITIZE := n KCOV_INSTRUMENT := n CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE) +CFLAGS_REMOVE_ist_entry.o = $(CC_FLAGS_FTRACE) $(RETHUNK_CFLAGS) CFLAGS_common.o += -fno-stack-protector +CFLAGS_ist_entry.o += -fno-stack-protector obj-y := entry.o entry_$(BITS).o syscall_$(BITS).o obj-y += common.o +obj-$(CONFIG_X86_64) += ist_entry.o obj-y += vdso/ obj-y += vsyscall/ diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 49ddc4dd3117..50a24cc83581 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -443,6 +443,184 @@ SYM_CODE_END(\asmsym) idtentry \vector asm_\cfunc \cfunc has_error_code=0 .endm +/** + * idtentry_ist - Macro to generate entry stubs for IST exceptions except #DF + * @vector: Vector number + * @asmsym: ASM symbol for the entry point + * @cfunc: C function to be called when it occurs in kernel + * @user_cfunc: C function to be called when it occurs in userspace + * @has_error_code: Hardware pushed error code on stack + * @stack_offset: Offset of the IST stack top in struct cea_exception_stacks + * + * The macro emits code to set up the kernel context for IST exceptions. + * + * From the hardware entry of the event to the SYM_INNER_LABEL(commit_\asmsym) + * is atomic-IST-entry (note: atomic-IST-entry is from the hardware entry, + * not merely from the first instruction of this macro). + * + * The atomic-IST-entry pushes pt_regs and copies the pt_regs to the IST + * main stack, and switches to it. If the atomic-IST-entry is interrupted + * by another IST event (except #DF), the new atomic-IST-entry will + * replicate the interrupted one as if every atomic-IST-entry is atomic. + * + * See the comments in entry64.c. + * + * When the cpu is on any IST stack or the IST main stack, %rsp can not be + * switched off except being interrupted by any IST exception or totally + * switching off (no usable data left). + * + * If the entry comes from user space, it turns to use the normal entry + * path finally on its kernel stack including the return to user space + * work and preemption checks on exit. The macro idtentry_body ensures + * the IST main stack is totally switched off (no usable data left) at + * the same time it switches to the kernel stack.. + * + * If hits in kernel mode then it needs to go through the paranoid + * entry as the exception can hit any random state. No preemption + * check on exit to keep the paranoid path simple. + */ +.macro idtentry_ist vector asmsym cfunc user_cfunc has_error_code:req, stack_offset:req +SYM_CODE_START(\asmsym) + UNWIND_HINT_IRET_REGS offset=\has_error_code*8 + ENDBR + + /* + * Clear X86_EFLAGS_AC, X86_EFLAGS_DF and set a default ORIG_RAX. + * + * The code setting ORIG_RAX will not be replicated if interrupted. + */ + ASM_CLAC + cld + + .if \has_error_code == 0 + pushq $-1 /* ORIG_RAX: no syscall to restart */ + .endif + + /* + * No register can be touched except %rsp,%rflags,%rip before + * pushing all the registers. It is indispensable for nested + * atomic-IST-entry to replicate pushing the registers. + */ + PUSH_REGS + + /* + * Finished pushing register, all registers can be touched by now. + * + * Clear registers for the C function ist_copy_regs_to_main_stack() + * and the handler to avoid any possible exploitation of any + * speculation attack. + */ + CLEAR_REGS + + /* + * Copy the pt_regs to the IST main stack including the pt_regs of + * the interrupted atomic-IST-entris, if any, by replicating. + */ + movq %rsp, %rdi /* pt_regs pointer on its own IST stack */ + leaq PTREGS_SIZE-\stack_offset(%rsp), %rsi /* struct cea_exception_stacks pointer */ + call ist_copy_regs_to_main_stack + + /* + * Commit stage. + */ +SYM_INNER_LABEL(start_commit_\asmsym, SYM_L_GLOBAL) + /* + * Switches to the IST main stack. Before the switching is done, + * %rax is the copied pt_regs pointer in IST main stack. + */ + movq %rax, %rsp + + /* + * The label should be immediate after the instruction that switches + * the stack since there is code assuming there is only one single + * instruction in the commit stage and the code assumes "%rsp in the + * IST main stack is also the sign of ending a atomic-IST-entry". + * (The code will be removed in future when %rip-based identifying + * is added.) + */ +SYM_INNER_LABEL(commit_\asmsym, SYM_L_GLOBAL) + + /* + * Now, it is on the IST main stack. For the whole kernel, the entries + * of the IST exceptions can be seen from here because the inside + * of the atomic-IST-entry can not be seen from the whole kernel + * except in the atomic-IST-entry or #DF. + */ + UNWIND_HINT_REGS + ENCODE_FRAME_POINTER + + /* + * The code setting ORIG_RAX will not be replicated if interrupted. + * So redo it here. + */ + .if \has_error_code == 0 + movq $-1, ORIG_RAX(%rsp) /* ORIG_RAX: no syscall to restart */ + .endif + + /* + * If the entry is from userspace, switch stacks and treat it as + * a normal entry. + */ + testb $3, CS(%rsp) + jnz .Lfrom_usermode_switch_stack_\@ + + /* + * paranoid_entry returns GS/CR3/SPEC_CTL information for + * paranoid_exit in RBX/R14/R15. + */ + call paranoid_entry + + movq %rsp, %rdi /* pt_regs pointer */ + .if \has_error_code == 1 + movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/ + movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */ + .endif + call \cfunc + + jmp paranoid_exit + +.Lfrom_usermode_switch_stack_\@: + /* Switch context: GS_BASE, CR3, SPEC_CTL. */ + swapgs + FENCE_SWAPGS_USER_ENTRY + /* We have user CR3. Change to kernel CR3. */ + SWITCH_TO_KERNEL_CR3 scratch_reg=%rax + IBRS_ENTER + UNTRAIN_RET + + /* Put the pt_regs onto the kernel task stack. */ + movq %rsp, %rdi /* arg0 = pt_regs pointer */ + call sync_regs + + /* + * Switch to the kernel task stack and use the user entry point. + * + * When from the user mode, the procedure has to atomically switches + * off the TSS-configured IST stacks too, so it switches to the IST + * main stack first, and then switches off the IST main stack in atomic + * fashion: when %rsp leaves the IST main stack, the IST main stack is + * totally free. + */ + movq %rax, %rsp + UNWIND_HINT_REGS + ENCODE_FRAME_POINTER + + movq %rsp, %rdi /* pt_regs pointer into 1st argument*/ + .if \has_error_code == 1 + movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/ + movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */ + .endif + call \user_cfunc + + /* For some configurations \user_cfunc ends up being a noreturn. */ + REACHABLE + + jmp error_return + +_ASM_NOKPROBE(\asmsym) +SYM_CODE_END(\asmsym) +.endm + /** * idtentry_mce_db - Macro to generate entry stubs for #MC and #DB * @vector: Vector number @@ -586,8 +764,23 @@ SYM_CODE_END(\asmsym) */ .macro idtentry_df vector asmsym cfunc SYM_CODE_START(\asmsym) + + /* + * This unwind-hint is incorect if it is the soft double fault rasied + * from ist_double_fault(). It doesn't matter since it is unrecoverable + * double fault. + */ UNWIND_HINT_IRET_REGS offset=8 ENDBR + + /* + * Set %rsp = %rsp - 8 if it is the soft double fault raisied from + * ist_double_fault(). The CPU doesn't push an error code in the case + * since it is injected by an INT instruction. + */ + btr $3, %rsp + UNWIND_HINT_IRET_REGS offset=8 + ASM_CLAC cld diff --git a/arch/x86/entry/ist_entry.c b/arch/x86/entry/ist_entry.c new file mode 100644 index 000000000000..e1b06306ac51 --- /dev/null +++ b/arch/x86/entry/ist_entry.c @@ -0,0 +1,299 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2022-2023 Lai Jiangshan, Ant Group + * + * Handle entries and exits for hardware traps and faults. + * + * It is as low level as entry_64.S and its code can be running in the + * environments that the GS base is a user controlled value, or the CR3 + * is the PTI user CR3 or both. + */ +#include + +#define IST_DOUBLE_FAULT_VECTOR 8 + +static __always_inline void ist_double_fault(void) +{ + asm volatile ("int $" __stringify(IST_DOUBLE_FAULT_VECTOR)); +} + +#define IN_CEA_ESTACK(ceastp, name, sp) \ + ((CEA_ESTACK_BOT(ceastp, name) <= (sp)) && \ + ((sp) < CEA_ESTACK_TOP(ceastp, name))) + +struct ist_ctx { + const struct pt_regs *regs; + unsigned long commit_ip; +}; + +#define DEFINE_IDENTIFY_IST(stack_name, sym_name, is_enabled) \ +extern char commit_asm_exc_##sym_name[]; \ +static __always_inline bool identify_ist_##sym_name( \ + const struct pt_regs *regs, struct cea_exception_stacks *stacks,\ + struct ist_ctx *ctx) \ +{ \ + if (!(is_enabled)) \ + return false; \ + if (!IN_CEA_ESTACK(stacks, stack_name, regs->sp)) \ + return false; \ + ctx->regs = (struct pt_regs *)CEA_ESTACK_TOP(stacks, stack_name) - 1; \ + ctx->commit_ip = (unsigned long)commit_asm_exc_##sym_name; \ + return true; \ +} + +DEFINE_IDENTIFY_IST(NMI, nmi, false) +DEFINE_IDENTIFY_IST(DB, debug, false) +DEFINE_IDENTIFY_IST(MCE, machine_check, false) +DEFINE_IDENTIFY_IST(VC, vmm_communication, false) + +static __always_inline bool identify_ist( + const struct pt_regs *regs, struct cea_exception_stacks *stacks, + struct ist_ctx *ctx) +{ + return identify_ist_nmi(regs, stacks, ctx) || + identify_ist_debug(regs, stacks, ctx) || + identify_ist_machine_check(regs, stacks, ctx) || + identify_ist_vmm_communication(regs, stacks, ctx); +} + +/* + * identify if an interrupted atomic-IST-entry had successfully saved + * the general registers onto its IST stack. + * + * Generally, the outmost atomic-IST-entry had likely successfully saved + * the general registers. If not, there must be one of the nested + * atomic-IST-entry had saved the general registers of the context that + * the outmost atomic-IST-entry had interrupted. + * + * Arguments: + * @nested: the nested atomic-IST-entry who had interrupted @interrupted + * @interrupted: the interrupted atomic-IST-entry. + * + * Returns: + * true: the interrupted atomic-IST-entry had saved the general registers. + * false: the interrupted atomic-IST-entry had not yet saved the general registers. + */ +static __always_inline +bool identify_if_gp_registers_saved(const struct pt_regs *nested, const struct pt_regs *interrupted) +{ + return nested->sp <= (unsigned long)(void *)interrupted; +} + +static __always_inline +void copy_regs_exception_head(struct pt_regs *target, const struct pt_regs *from) +{ + target->ss = from->ss; + target->sp = from->sp; + target->flags = from->flags; + target->cs = from->cs; + target->ip = from->ip; + target->orig_ax = from->orig_ax; +} + +static __always_inline +void copy_regs_general_registers(struct pt_regs *target, const struct pt_regs *from) +{ + target->di = from->di; + target->si = from->si; + target->dx = from->dx; + target->cx = from->cx; + target->ax = from->ax; + target->r8 = from->r8; + target->r9 = from->r9; + target->r10 = from->r10; + target->r11 = from->r11; + target->bx = from->bx; + target->bp = from->bp; + target->r12 = from->r12; + target->r13 = from->r13; + target->r14 = from->r14; + target->r15 = from->r15; +} + +/* + * Do the work as the outmost atomic-IST-entry to copy the supposed pt_regs + * of the interrupted context to the IST main stack. (If the ongoing + * atomic-IST-entry is the outmost one, the work is literally doing copy as + * the outmost, if not, the work is replicating the outmost.) + * + * The hardware-entry of the outmost atomic-IST-entry has already saved the + * exception head of the pt_regs. If the outmost atomic-IST-entry was + * unfortunately interrupted before fully saving all the general registers, + * the general registers are untouched and must be saved by one of the + * consequent nested atomic-IST-entries. The identifying code can just + * examine all the nested atomic-IST-entries to find which one has saved + * the general registers. + */ +static __always_inline +void copy_outmost(struct pt_regs *target, const struct pt_regs *outmost, const struct pt_regs *gp) +{ + copy_regs_exception_head(target, outmost); + copy_regs_general_registers(target, gp); +} + +/* + * Replicate the interrupted atomic-IST-entry's CLAC and CLD in the ASM + * code. Even SMAP is not enabled, CLAC is replicated unconditionally + * since doing so has no harm. + */ +static __always_inline void replicate_clac_cld(struct pt_regs *target) +{ + target->flags &= ~(unsigned long)(X86_EFLAGS_AC | X86_EFLAGS_DF); +} + +/* Replicate the interrupted atomic-IST-entry's CLEAR_REGS macro. */ +static __always_inline void replicate_clear_regs(struct pt_regs *target) +{ + target->di = 0; + target->si = 0; + target->dx = 0; + target->cx = 0; + target->ax = 0; + target->r8 = 0; + target->r9 = 0; + target->r10 = 0; + target->r11 = 0; + target->bx = 0; + target->bp = 0; + target->r12 = 0; + target->r13 = 0; + target->r14 = 0; + target->r15 = 0; +} + +/* + * Replicate the action that the interrupted atomic-IST-entry's + * ist_copy_regs_to_main_stack() clobbers caller-saved registers + */ +static __always_inline void replicate_func_clobber(struct pt_regs *target) +{ + /* nothing needs to be done. */ +} + +/* + * Replicate the copy operation in the interrupted atomic-IST-entry's + * ist_copy_regs_to_main_stack() + */ +static __always_inline void replicate_func_copy(struct pt_regs *target) +{ + /* + * To avoid recursive functions calls with __always_inline, the + * copy operation for the interrupted atomic-IST-entry has been + * done in the caller of copy_nested(). Nothing need to be done. + */ +} + +#define IST_FRAME_SIZE ALIGN(sizeof(struct pt_regs), 16) + +/* + * Replicate the return result of the interrupted atomic-IST-entry's + * ist_copy_regs_to_main_stack() in %rax and the commit operation. + */ +static __always_inline void replicate_func_result_and_commit(struct pt_regs *target, unsigned long commit_ip) +{ + void *target_of_interrupted = (void *)target + IST_FRAME_SIZE; + + /* return result in %rax */ + target->ax = (unsigned long)target_of_interrupted; + /* move %rax, %rsp */ + target->sp = (unsigned long)target_of_interrupted; + /* the %rip advances to commit point */ + target->ip = commit_ip; +} + +/* + * Do the work as a nested atomic-IST-entry to copy the supposed pt_regs + * of the interrupted context to the IST main stack. + * + * The hardware-entry of the nested atomic-IST-entry has already saved + * the exception head of the pt_regs of the interrupted context (inside + * the interrupted atomic-IST-entry). To maintain the atomic attribute + * of the atomic-IST-entry, the copy_nested() (of the ongoing nested + * atomic-IST-entry) has to replicate all that the interrupted + * atomic-IST-entries should have been done till the commit point and + * copy the supposed saved context (pt_regs). + * + * To avoid touching any saved pt_regs, the replicating is actually + * directly applied on the target pt_regs. + */ +static __always_inline +void copy_nested(struct pt_regs *target, const struct pt_regs *nested, unsigned long commit_ip) +{ + copy_regs_exception_head(target, nested); + replicate_clac_cld(target); + replicate_clear_regs(target); + replicate_func_clobber(target); + replicate_func_copy(target); + replicate_func_result_and_commit(target, commit_ip); +} + +asmlinkage __visible __noinstr_section(".entry.text") +struct pt_regs *ist_copy_regs_to_main_stack( + const struct pt_regs *regs, struct cea_exception_stacks *stacks) +{ + unsigned long ist_main_sp = CEA_ESTACK_TOP(stacks, IST); + struct ist_ctx ist_ctx[8]; + const struct pt_regs *gp_saved; + struct pt_regs *target; + int nr_entries, i; + + /* + * Identify all of the atomic-IST-entris. + * + * The current ongoing atomic-IST-entry doesn't need to be identified, + * but is also put in the @ist_ctx[0] for later convenience. + * + * The for-loop identifies what the context @regs has interrupted is. + * It travels back to the outmost atomic-IST-entry. + * + * Result: + * Identified result is put in ist_ctx[i]. + * ist_ctx[0] is the current ongoing atomic-IST-entry. + * ist_ctx[nr_entries-1] is the outmost atomic-IST-entry. + * gp_saved is the atomic-IST-entry that has saved the general registers. + */ + ist_ctx[0].regs = regs; + ist_ctx[0].commit_ip = -1; /* unused */ + nr_entries = 1; + gp_saved = regs; + for (;;) { + if (user_mode((struct pt_regs *)regs)) + break; + if (ip_within_syscall_gap((struct pt_regs *)regs)) + break; + if (!identify_ist(regs, stacks, &ist_ctx[nr_entries])) { + /* locate the top of copying target pt_regs */ + if (IN_CEA_ESTACK(stacks, IST, regs->sp)) + ist_main_sp = ALIGN_DOWN(regs->sp, 16); + break; + } + if (identify_if_gp_registers_saved(regs, ist_ctx[nr_entries].regs)) + gp_saved = ist_ctx[nr_entries].regs; + regs = ist_ctx[nr_entries].regs; + nr_entries++; + if (nr_entries >= ARRAY_SIZE(ist_ctx)) + ist_double_fault(); + } + + if (!IN_CEA_ESTACK(stacks, IST, ist_main_sp - IST_FRAME_SIZE * nr_entries)) + ist_double_fault(); + + /* + * Copy the saved pt_regs to the IST main stack. + * + * For each atomic-IST-entry including the interrupted ones and + * the current ongoing one, calls either copy_outmost() or copy_nested() + * to copy the pt_regs of what should have been saved, by replicating + * if needed, to the IST main stack. + */ + ist_main_sp -= IST_FRAME_SIZE; + target = (void *)ist_main_sp; + copy_outmost(target, ist_ctx[nr_entries - 1].regs, gp_saved); + for (i = nr_entries - 2; unlikely(i >= 0); i--) { + ist_main_sp -= IST_FRAME_SIZE; + target = (void *)ist_main_sp; + copy_nested(target, ist_ctx[i].regs, ist_ctx[i+1].commit_ip); + } + + return target; +} diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c index bb65371ea9df..f861a56c0002 100644 --- a/arch/x86/kernel/asm-offsets_64.c +++ b/arch/x86/kernel/asm-offsets_64.c @@ -60,5 +60,12 @@ int main(void) OFFSET(FIXED_stack_canary, fixed_percpu_data, stack_canary); BLANK(); #endif + + DEFINE(CEA_stacks_NMI, offsetofend(struct cea_exception_stacks, NMI_stack)); + DEFINE(CEA_stacks_DB, offsetofend(struct cea_exception_stacks, DB_stack)); + DEFINE(CEA_stacks_MCE, offsetofend(struct cea_exception_stacks, MCE_stack)); + DEFINE(CEA_stacks_VC, offsetofend(struct cea_exception_stacks, VC_stack)); + BLANK(); + return 0; } diff --git a/arch/x86/kernel/callthunks.c b/arch/x86/kernel/callthunks.c index ffea98f9064b..e756c89996d8 100644 --- a/arch/x86/kernel/callthunks.c +++ b/arch/x86/kernel/callthunks.c @@ -123,6 +123,8 @@ static bool skip_addr(void *dest) { if (dest == error_entry) return true; + if (dest == ist_copy_regs_to_main_stack) + return true; if (dest == paranoid_entry) return true; if (dest == xen_error_entry) diff --git a/tools/objtool/check.c b/tools/objtool/check.c index f937be1afe65..8dfa627d4b41 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3998,6 +3998,11 @@ static int validate_unret(struct objtool_file *file) return warnings; } +static bool in_ist_entry(struct instruction *insn) +{ + return !strcmp(insn->sym->name, "ist_copy_regs_to_main_stack"); +} + static int validate_retpoline(struct objtool_file *file) { struct instruction *insn; @@ -4016,7 +4021,7 @@ static int validate_retpoline(struct objtool_file *file) continue; if (insn->type == INSN_RETURN) { - if (opts.rethunk) { + if (opts.rethunk && !in_ist_entry(insn)) { WARN_FUNC("'naked' return found in RETHUNK build", insn->sec, insn->offset); } else From patchwork Mon Apr 3 14:06:02 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lai Jiangshan X-Patchwork-Id: 13198227 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3437C761AF for ; Mon, 3 Apr 2023 14:06:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232810AbjDCOGA (ORCPT ); Mon, 3 Apr 2023 10:06:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42670 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232528AbjDCOFy (ORCPT ); Mon, 3 Apr 2023 10:05:54 -0400 Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com [IPv6:2607:f8b0:4864:20::1032]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C998B24AF4; Mon, 3 Apr 2023 07:05:27 -0700 (PDT) Received: by mail-pj1-x1032.google.com with SMTP id mp3-20020a17090b190300b0023fcc8ce113so32629426pjb.4; Mon, 03 Apr 2023 07:05:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680530726; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=uW7m7/qrXam5zwn+ChF6Qpt85GH+CQW9JyNbrYZqUtk=; b=QN64awyKH2bnzXFUM0DQTGfpVpq/h63QBaT+4tckfdagXqJDEmBMWLu2Vkl7oEJKic fycFWDRzX3T7b8VVtnlAZcPhvxz1OzgldgL8O3FItUC/w73Hf7HYY/urvNTbXb0KXW3j Ihk6w/7gVeGFBMR3zgPFLnVpDPtP/rL31KGdJhm9vvMfOCd3fH/H7IfLofRVdFqMgint 9FASraQTodJWJLEzvWW7bI28zLRlT9L/J+thd2t110TlJ4cr6JcZFZpf5YuQb1UAxFgC S2FCaXBVsFD2k4RPksbuSugf9SIYjFmqM7kYB9D+wPXXEw4xgWoopTGc46cT9PqZAC2Q zMtg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680530726; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=uW7m7/qrXam5zwn+ChF6Qpt85GH+CQW9JyNbrYZqUtk=; b=jG6zvbSOgT3hT1Mu37EetBgApV5CenjEpSDD4Wr6xXjALjGd0d6j1SV6oM8YkR1uDj AvPchw4PyMAXgmve3dzfdynU+rGPAx1cA96qfkGskngQtrAJjUM/U6wD35/ZvKACxV6A TcuIWxcCRDtCx+pZT9G84jF3iQecA6hiCkqmjjB2sGScKlcJHRs5EeCE8MbVFq8OuFHn 32F8SnQ/6wbKi0qceiyhuM1R11d8IlkgcDsa8PGrqDYe9pePWiqy4zt1nP8JV4l4OyhS LG2qNk0TcfI3p2jYRMcCQJ1BJQUINq+pupN7N1/8G5eclKxBQC+vZ7c6hLGyrdBdPs5U WWag== X-Gm-Message-State: AAQBX9e9evMYsLZs7JQI4tlQz44/6OECl1C26dS5SNa6+QGzJjeTNxAD W1yiaZFVYre+OPdWgs980xTJuQxBbyE= X-Google-Smtp-Source: AKy350Yiz3pkX2+ZDR0fWkZykmS9yXCpx/nMMdG3lv9PdOfELPEwKXfhTSZRhGUtA1MZusU5VNg0zw== X-Received: by 2002:a17:90b:314d:b0:23e:f8e2:9ec8 with SMTP id ip13-20020a17090b314d00b0023ef8e29ec8mr39219405pjb.16.1680530726066; Mon, 03 Apr 2023 07:05:26 -0700 (PDT) Received: from localhost ([47.89.225.180]) by smtp.gmail.com with ESMTPSA id b9-20020a17090a800900b00233db0db3dfsm9902721pjn.7.2023.04.03.07.05.24 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Apr 2023 07:05:25 -0700 (PDT) From: Lai Jiangshan To: linux-kernel@vger.kernel.org Cc: Lai Jiangshan , "H. Peter Anvin" , Andi Kleen , Andrew Cooper , Andy Lutomirski , Asit Mallick , Cfir Cohen , Dan Williams , Dave Hansen , David Kaplan , David Rientjes , Dirk Hohndel , Erdem Aktas , Jan Kiszka , Jiri Slaby , Joerg Roedel , Juergen Gross , Kees Cook , Kirill Shutemov , Kuppuswamy Sathyanarayanan , Linus Torvalds , Mike Stunes , Peter Zijlstra , Raj Ashok , Sean Christopherson , Thomas Gleixner , Tom Lendacky , Tony Luck , kvm@vger.kernel.org, linux-coco@lists.linux.dev, x86@kernel.org, Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Kuppuswamy Sathyanarayanan Subject: [RFC PATCH 4/7] x86/entry: Use atomic-IST-entry for NMI Date: Mon, 3 Apr 2023 22:06:02 +0800 Message-Id: <20230403140605.540512-5-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20230403140605.540512-1-jiangshanlai@gmail.com> References: <20230403140605.540512-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Lai Jiangshan Signed-off-by: Lai Jiangshan --- arch/x86/entry/entry_64.S | 373 -------------------------------- arch/x86/entry/ist_entry.c | 2 +- arch/x86/include/asm/idtentry.h | 9 +- 3 files changed, 7 insertions(+), 377 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 50a24cc83581..2bb7ab8512dc 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -1341,379 +1341,6 @@ SYM_CODE_START_LOCAL(error_return) jmp swapgs_restore_regs_and_return_to_usermode SYM_CODE_END(error_return) -/* - * Runs on exception stack. Xen PV does not go through this path at all, - * so we can use real assembly here. - * - * Registers: - * %r14: Used to save/restore the CR3 of the interrupted context - * when PAGE_TABLE_ISOLATION is in use. Do not clobber. - */ -SYM_CODE_START(asm_exc_nmi) - UNWIND_HINT_IRET_REGS - ENDBR - - /* - * We allow breakpoints in NMIs. If a breakpoint occurs, then - * the iretq it performs will take us out of NMI context. - * This means that we can have nested NMIs where the next - * NMI is using the top of the stack of the previous NMI. We - * can't let it execute because the nested NMI will corrupt the - * stack of the previous NMI. NMI handlers are not re-entrant - * anyway. - * - * To handle this case we do the following: - * Check the a special location on the stack that contains - * a variable that is set when NMIs are executing. - * The interrupted task's stack is also checked to see if it - * is an NMI stack. - * If the variable is not set and the stack is not the NMI - * stack then: - * o Set the special variable on the stack - * o Copy the interrupt frame into an "outermost" location on the - * stack - * o Copy the interrupt frame into an "iret" location on the stack - * o Continue processing the NMI - * If the variable is set or the previous stack is the NMI stack: - * o Modify the "iret" location to jump to the repeat_nmi - * o return back to the first NMI - * - * Now on exit of the first NMI, we first clear the stack variable - * The NMI stack will tell any nested NMIs at that point that it is - * nested. Then we pop the stack normally with iret, and if there was - * a nested NMI that updated the copy interrupt stack frame, a - * jump will be made to the repeat_nmi code that will handle the second - * NMI. - * - * However, espfix prevents us from directly returning to userspace - * with a single IRET instruction. Similarly, IRET to user mode - * can fault. We therefore handle NMIs from user space like - * other IST entries. - */ - - ASM_CLAC - cld - - /* Use %rdx as our temp variable throughout */ - pushq %rdx - - testb $3, CS-RIP+8(%rsp) - jz .Lnmi_from_kernel - - /* - * NMI from user mode. We need to run on the thread stack, but we - * can't go through the normal entry paths: NMIs are masked, and - * we don't want to enable interrupts, because then we'll end - * up in an awkward situation in which IRQs are on but NMIs - * are off. - * - * We also must not push anything to the stack before switching - * stacks lest we corrupt the "NMI executing" variable. - */ - - swapgs - FENCE_SWAPGS_USER_ENTRY - SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx - movq %rsp, %rdx - movq PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp - UNWIND_HINT_IRET_REGS base=%rdx offset=8 - pushq 5*8(%rdx) /* pt_regs->ss */ - pushq 4*8(%rdx) /* pt_regs->rsp */ - pushq 3*8(%rdx) /* pt_regs->flags */ - pushq 2*8(%rdx) /* pt_regs->cs */ - pushq 1*8(%rdx) /* pt_regs->rip */ - UNWIND_HINT_IRET_REGS - pushq $-1 /* pt_regs->orig_ax */ - PUSH_AND_CLEAR_REGS rdx=(%rdx) - ENCODE_FRAME_POINTER - - IBRS_ENTER - UNTRAIN_RET - - /* - * At this point we no longer need to worry about stack damage - * due to nesting -- we're on the normal thread stack and we're - * done with the NMI stack. - */ - - movq %rsp, %rdi - movq $-1, %rsi - call exc_nmi - - /* - * Return back to user mode. We must *not* do the normal exit - * work, because we don't want to enable interrupts. - */ - jmp swapgs_restore_regs_and_return_to_usermode - -.Lnmi_from_kernel: - /* - * Here's what our stack frame will look like: - * +---------------------------------------------------------+ - * | original SS | - * | original Return RSP | - * | original RFLAGS | - * | original CS | - * | original RIP | - * +---------------------------------------------------------+ - * | temp storage for rdx | - * +---------------------------------------------------------+ - * | "NMI executing" variable | - * +---------------------------------------------------------+ - * | iret SS } Copied from "outermost" frame | - * | iret Return RSP } on each loop iteration; overwritten | - * | iret RFLAGS } by a nested NMI to force another | - * | iret CS } iteration if needed. | - * | iret RIP } | - * +---------------------------------------------------------+ - * | outermost SS } initialized in first_nmi; | - * | outermost Return RSP } will not be changed before | - * | outermost RFLAGS } NMI processing is done. | - * | outermost CS } Copied to "iret" frame on each | - * | outermost RIP } iteration. | - * +---------------------------------------------------------+ - * | pt_regs | - * +---------------------------------------------------------+ - * - * The "original" frame is used by hardware. Before re-enabling - * NMIs, we need to be done with it, and we need to leave enough - * space for the asm code here. - * - * We return by executing IRET while RSP points to the "iret" frame. - * That will either return for real or it will loop back into NMI - * processing. - * - * The "outermost" frame is copied to the "iret" frame on each - * iteration of the loop, so each iteration starts with the "iret" - * frame pointing to the final return target. - */ - - /* - * Determine whether we're a nested NMI. - * - * If we interrupted kernel code between repeat_nmi and - * end_repeat_nmi, then we are a nested NMI. We must not - * modify the "iret" frame because it's being written by - * the outer NMI. That's okay; the outer NMI handler is - * about to about to call exc_nmi() anyway, so we can just - * resume the outer NMI. - */ - - movq $repeat_nmi, %rdx - cmpq 8(%rsp), %rdx - ja 1f - movq $end_repeat_nmi, %rdx - cmpq 8(%rsp), %rdx - ja nested_nmi_out -1: - - /* - * Now check "NMI executing". If it's set, then we're nested. - * This will not detect if we interrupted an outer NMI just - * before IRET. - */ - cmpl $1, -8(%rsp) - je nested_nmi - - /* - * Now test if the previous stack was an NMI stack. This covers - * the case where we interrupt an outer NMI after it clears - * "NMI executing" but before IRET. We need to be careful, though: - * there is one case in which RSP could point to the NMI stack - * despite there being no NMI active: naughty userspace controls - * RSP at the very beginning of the SYSCALL targets. We can - * pull a fast one on naughty userspace, though: we program - * SYSCALL to mask DF, so userspace cannot cause DF to be set - * if it controls the kernel's RSP. We set DF before we clear - * "NMI executing". - */ - lea 6*8(%rsp), %rdx - /* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */ - cmpq %rdx, 4*8(%rsp) - /* If the stack pointer is above the NMI stack, this is a normal NMI */ - ja first_nmi - - subq $EXCEPTION_STKSZ, %rdx - cmpq %rdx, 4*8(%rsp) - /* If it is below the NMI stack, it is a normal NMI */ - jb first_nmi - - /* Ah, it is within the NMI stack. */ - - testb $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp) - jz first_nmi /* RSP was user controlled. */ - - /* This is a nested NMI. */ - -nested_nmi: - /* - * Modify the "iret" frame to point to repeat_nmi, forcing another - * iteration of NMI handling. - */ - subq $8, %rsp - leaq -10*8(%rsp), %rdx - pushq $__KERNEL_DS - pushq %rdx - pushfq - pushq $__KERNEL_CS - pushq $repeat_nmi - - /* Put stack back */ - addq $(6*8), %rsp - -nested_nmi_out: - popq %rdx - - /* We are returning to kernel mode, so this cannot result in a fault. */ - iretq - -first_nmi: - /* Restore rdx. */ - movq (%rsp), %rdx - - /* Make room for "NMI executing". */ - pushq $0 - - /* Leave room for the "iret" frame */ - subq $(5*8), %rsp - - /* Copy the "original" frame to the "outermost" frame */ - .rept 5 - pushq 11*8(%rsp) - .endr - UNWIND_HINT_IRET_REGS - - /* Everything up to here is safe from nested NMIs */ - -#ifdef CONFIG_DEBUG_ENTRY - /* - * For ease of testing, unmask NMIs right away. Disabled by - * default because IRET is very expensive. - */ - pushq $0 /* SS */ - pushq %rsp /* RSP (minus 8 because of the previous push) */ - addq $8, (%rsp) /* Fix up RSP */ - pushfq /* RFLAGS */ - pushq $__KERNEL_CS /* CS */ - pushq $1f /* RIP */ - iretq /* continues at repeat_nmi below */ - UNWIND_HINT_IRET_REGS -1: -#endif - -repeat_nmi: - ANNOTATE_NOENDBR // this code - /* - * If there was a nested NMI, the first NMI's iret will return - * here. But NMIs are still enabled and we can take another - * nested NMI. The nested NMI checks the interrupted RIP to see - * if it is between repeat_nmi and end_repeat_nmi, and if so - * it will just return, as we are about to repeat an NMI anyway. - * This makes it safe to copy to the stack frame that a nested - * NMI will update. - * - * RSP is pointing to "outermost RIP". gsbase is unknown, but, if - * we're repeating an NMI, gsbase has the same value that it had on - * the first iteration. paranoid_entry will load the kernel - * gsbase if needed before we call exc_nmi(). "NMI executing" - * is zero. - */ - movq $1, 10*8(%rsp) /* Set "NMI executing". */ - - /* - * Copy the "outermost" frame to the "iret" frame. NMIs that nest - * here must not modify the "iret" frame while we're writing to - * it or it will end up containing garbage. - */ - addq $(10*8), %rsp - .rept 5 - pushq -6*8(%rsp) - .endr - subq $(5*8), %rsp -end_repeat_nmi: - ANNOTATE_NOENDBR // this code - - /* - * Everything below this point can be preempted by a nested NMI. - * If this happens, then the inner NMI will change the "iret" - * frame to point back to repeat_nmi. - */ - pushq $-1 /* ORIG_RAX: no syscall to restart */ - - PUSH_AND_CLEAR_REGS - UNWIND_HINT_REGS - ENCODE_FRAME_POINTER - - /* - * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit - * as we should not be calling schedule in NMI context. - * Even with normal interrupts enabled. An NMI should not be - * setting NEED_RESCHED or anything that normal interrupts and - * exceptions might do. - */ - call paranoid_entry - - movq %rsp, %rdi - movq $-1, %rsi - call exc_nmi - - /* Always restore stashed SPEC_CTRL value (see paranoid_entry) */ - IBRS_EXIT save_reg=%r15 - - /* Always restore stashed CR3 value (see paranoid_entry) */ - RESTORE_CR3 scratch_reg=%r15 save_reg=%r14 - - /* - * The above invocation of paranoid_entry stored the GSBASE - * related information in R/EBX depending on the availability - * of FSGSBASE. - * - * If FSGSBASE is enabled, restore the saved GSBASE value - * unconditionally, otherwise take the conditional SWAPGS path. - */ - ALTERNATIVE "jmp nmi_no_fsgsbase", "", X86_FEATURE_FSGSBASE - - wrgsbase %rbx - jmp nmi_restore - -nmi_no_fsgsbase: - /* EBX == 0 -> invoke SWAPGS */ - testl %ebx, %ebx - jnz nmi_restore - -nmi_swapgs: - swapgs - -nmi_restore: - POP_REGS - - /* - * Skip orig_ax and the "outermost" frame to point RSP at the "iret" - * at the "iret" frame. - */ - addq $6*8, %rsp - - /* - * Clear "NMI executing". Set DF first so that we can easily - * distinguish the remaining code between here and IRET from - * the SYSCALL entry and exit paths. - * - * We arguably should just inspect RIP instead, but I (Andy) wrote - * this code when I had the misapprehension that Xen PV supported - * NMIs, and Xen PV would break that approach. - */ - std - movq $0, 5*8(%rsp) /* clear "NMI executing" */ - - /* - * iretq reads the "iret" frame and exits the NMI stack in a - * single instruction. We are returning to kernel mode, so this - * cannot result in a fault. Similarly, we don't need to worry - * about espfix64 on the way back to kernel mode. - */ - iretq -SYM_CODE_END(asm_exc_nmi) - #ifndef CONFIG_IA32_EMULATION /* * This handles SYSCALL from 32-bit code. There is no way to program diff --git a/arch/x86/entry/ist_entry.c b/arch/x86/entry/ist_entry.c index e1b06306ac51..407571cc4a8c 100644 --- a/arch/x86/entry/ist_entry.c +++ b/arch/x86/entry/ist_entry.c @@ -41,7 +41,7 @@ static __always_inline bool identify_ist_##sym_name( \ return true; \ } -DEFINE_IDENTIFY_IST(NMI, nmi, false) +DEFINE_IDENTIFY_IST(NMI, nmi, true) DEFINE_IDENTIFY_IST(DB, debug, false) DEFINE_IDENTIFY_IST(MCE, machine_check, false) DEFINE_IDENTIFY_IST(VC, vmm_communication, false) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index b241af4ce9b4..b568f1de6da6 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -450,6 +450,9 @@ __visible noinstr void func(struct pt_regs *regs, \ idtentry_sysvec vector func #ifdef CONFIG_X86_64 +# define DECLARE_IDTENTRY_NMI(vector, func) \ + idtentry_ist vector asm_##func func func has_error_code=0 stack_offset=CEA_stacks_NMI + # define DECLARE_IDTENTRY_MCE(vector, func) \ idtentry_mce_db vector asm_##func func @@ -475,11 +478,11 @@ __visible noinstr void func(struct pt_regs *regs, \ /* No ASM emitted for XEN hypervisor callback */ # define DECLARE_IDTENTRY_XENCB(vector, func) -#endif - -/* No ASM code emitted for NMI */ +/* No ASM code emitted for NMI for X86_32 */ #define DECLARE_IDTENTRY_NMI(vector, func) +#endif + /* * ASM code to emit the common vector entry stubs where each stub is * packed into IDT_ALIGN bytes. From patchwork Mon Apr 3 14:06:03 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lai Jiangshan X-Patchwork-Id: 13198228 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 94074C76196 for ; Mon, 3 Apr 2023 14:06:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232797AbjDCOGR (ORCPT ); Mon, 3 Apr 2023 10:06:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43700 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232822AbjDCOGC (ORCPT ); Mon, 3 Apr 2023 10:06:02 -0400 Received: from mail-pj1-x1035.google.com (mail-pj1-x1035.google.com [IPv6:2607:f8b0:4864:20::1035]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F413E40DF; Mon, 3 Apr 2023 07:05:32 -0700 (PDT) Received: by mail-pj1-x1035.google.com with SMTP id o6-20020a17090a9f8600b0023f32869993so32643515pjp.1; Mon, 03 Apr 2023 07:05:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680530731; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tn2wKwx0EXmrg4KUjii9CLG1/m9PFl3nKLSeijJ1CpQ=; b=bpKyUkA1BBY1Z8ckFIx6sm8b6cCHlPM87VXGt5UEhc2HjF2VVp1oJCzKyVvcA3lOAo 17R5cyzdXHXyhp0+HVocGg6LkmttssT7gvSXdAgjGl36dq8ejzCOrDYjKs9yCWjKZ/nU N4+PTIFOPadNrsJO2ZDeNJ/tMy3LO/8xYGHR9NhvpLOdo04i5UrCDXlc4RjhUuPIYH4O smVPF7cWsawXXpUnRln2PoRwwBbbSFvlOeq1PxciVcAhAE2BkVj1PKsvhKxk06JUI8jL 5stDFNzAv80qVPEEJnx2w2qMP0gm6ZQYEam5LzVdE4ATvyP7KWrbaFn8AJNejem8o2bI ucQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680530731; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tn2wKwx0EXmrg4KUjii9CLG1/m9PFl3nKLSeijJ1CpQ=; b=PowNg5mS3zQD54Y0gG3byy97C0m4v5wBPhYIOFcTWy2nO6t85mCVNX5ZBW01e5311B 2oZWM0Wk3ZQ+OIkyH35/2ar3uRPqOp71/X0ZwaEOKUcPx2xLzNr5Q80Qi5r8sSIoXWGu uFdp18j/QYgZPQAx/ljcTZYICopUEszVKkBvg2GE44c/O6KCV4CBWA4VzFkENI3pobAi OPD/IwCkgZyWWdbVX+p7mzkXGMQrQMu/fwH0SkF3n8Uqgp6LfC8YkWZYxDVFPOt9Bxoy i/j5TOAvLFNQfBDzBXIPxO0ltjoXUmccBGd916X9KtYSvxpXwiBrV83RgJQ1IvjnD7aa GOag== X-Gm-Message-State: AAQBX9eCSQ6zILdg/A1N6mOIwbr6PEbnHwKEXKhHxnU6ii1sFazVc97J 4iI6s9RgyIk3QRxVCZRltPlcNsToQpU= X-Google-Smtp-Source: AKy350YMy6MWeagJFudQRTfciSmnNqn7Ihy86w3MRx93dio4hrdleDz+fZZhiUbRi9zPHWP23WD/Qw== X-Received: by 2002:a17:903:2312:b0:1a1:a6e5:764b with SMTP id d18-20020a170903231200b001a1a6e5764bmr48814320plh.60.1680530731648; Mon, 03 Apr 2023 07:05:31 -0700 (PDT) Received: from localhost ([47.88.5.130]) by smtp.gmail.com with ESMTPSA id f19-20020a170902e99300b0019acd3151d0sm6649416plb.114.2023.04.03.07.05.30 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Apr 2023 07:05:31 -0700 (PDT) From: Lai Jiangshan To: linux-kernel@vger.kernel.org Cc: Lai Jiangshan , "H. Peter Anvin" , Andi Kleen , Andrew Cooper , Andy Lutomirski , Asit Mallick , Cfir Cohen , Dan Williams , Dave Hansen , David Kaplan , David Rientjes , Dirk Hohndel , Erdem Aktas , Jan Kiszka , Jiri Slaby , Joerg Roedel , Juergen Gross , Kees Cook , Kirill Shutemov , Kuppuswamy Sathyanarayanan , Linus Torvalds , Mike Stunes , Peter Zijlstra , Raj Ashok , Sean Christopherson , Thomas Gleixner , Tom Lendacky , Tony Luck , kvm@vger.kernel.org, linux-coco@lists.linux.dev, x86@kernel.org, Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Kuppuswamy Sathyanarayanan Subject: [RFC PATCH 5/7] x86/entry: Use atomic-IST-entry for MCE and DB Date: Mon, 3 Apr 2023 22:06:03 +0800 Message-Id: <20230403140605.540512-6-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20230403140605.540512-1-jiangshanlai@gmail.com> References: <20230403140605.540512-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Lai Jiangshan Signed-off-by: Lai Jiangshan --- arch/x86/entry/entry_64.S | 53 --------------------------------- arch/x86/entry/ist_entry.c | 4 +-- arch/x86/include/asm/idtentry.h | 4 +-- 3 files changed, 4 insertions(+), 57 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 2bb7ab8512dc..e4ddc793f841 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -621,59 +621,6 @@ _ASM_NOKPROBE(\asmsym) SYM_CODE_END(\asmsym) .endm -/** - * idtentry_mce_db - Macro to generate entry stubs for #MC and #DB - * @vector: Vector number - * @asmsym: ASM symbol for the entry point - * @cfunc: C function to be called - * - * The macro emits code to set up the kernel context for #MC and #DB - * - * If the entry comes from user space it uses the normal entry path - * including the return to user space work and preemption checks on - * exit. - * - * If hits in kernel mode then it needs to go through the paranoid - * entry as the exception can hit any random state. No preemption - * check on exit to keep the paranoid path simple. - */ -.macro idtentry_mce_db vector asmsym cfunc -SYM_CODE_START(\asmsym) - UNWIND_HINT_IRET_REGS - ENDBR - ASM_CLAC - cld - - pushq $-1 /* ORIG_RAX: no syscall to restart */ - - /* - * If the entry is from userspace, switch stacks and treat it as - * a normal entry. - */ - testb $3, CS-ORIG_RAX(%rsp) - jnz .Lfrom_usermode_switch_stack_\@ - - PUSH_AND_CLEAR_REGS - UNWIND_HINT_REGS - ENCODE_FRAME_POINTER - - /* paranoid_entry returns GS information for paranoid_exit in EBX. */ - call paranoid_entry - - movq %rsp, %rdi /* pt_regs pointer */ - - call \cfunc - - jmp paranoid_exit - - /* Switch to the regular task stack and use the noist entry point */ -.Lfrom_usermode_switch_stack_\@: - idtentry_body noist_\cfunc, has_error_code=0 - -_ASM_NOKPROBE(\asmsym) -SYM_CODE_END(\asmsym) -.endm - #ifdef CONFIG_AMD_MEM_ENCRYPT /** * idtentry_vc - Macro to generate entry stub for #VC diff --git a/arch/x86/entry/ist_entry.c b/arch/x86/entry/ist_entry.c index 407571cc4a8c..946b3b537bd5 100644 --- a/arch/x86/entry/ist_entry.c +++ b/arch/x86/entry/ist_entry.c @@ -42,8 +42,8 @@ static __always_inline bool identify_ist_##sym_name( \ } DEFINE_IDENTIFY_IST(NMI, nmi, true) -DEFINE_IDENTIFY_IST(DB, debug, false) -DEFINE_IDENTIFY_IST(MCE, machine_check, false) +DEFINE_IDENTIFY_IST(DB, debug, true) +DEFINE_IDENTIFY_IST(MCE, machine_check, IS_ENABLED(CONFIG_X86_MCE)) DEFINE_IDENTIFY_IST(VC, vmm_communication, false) static __always_inline bool identify_ist( diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index b568f1de6da6..01f3152ffe82 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -454,10 +454,10 @@ __visible noinstr void func(struct pt_regs *regs, \ idtentry_ist vector asm_##func func func has_error_code=0 stack_offset=CEA_stacks_NMI # define DECLARE_IDTENTRY_MCE(vector, func) \ - idtentry_mce_db vector asm_##func func + idtentry_ist vector asm_##func func noist_##func has_error_code=0 stack_offset=CEA_stacks_MCE # define DECLARE_IDTENTRY_DEBUG(vector, func) \ - idtentry_mce_db vector asm_##func func + idtentry_ist vector asm_##func func noist_##func has_error_code=0 stack_offset=CEA_stacks_DB # define DECLARE_IDTENTRY_DF(vector, func) \ idtentry_df vector asm_##func func From patchwork Mon Apr 3 14:06:04 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lai Jiangshan X-Patchwork-Id: 13198229 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AB054C761A6 for ; Mon, 3 Apr 2023 14:06:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232878AbjDCOGp (ORCPT ); Mon, 3 Apr 2023 10:06:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43310 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232578AbjDCOGX (ORCPT ); Mon, 3 Apr 2023 10:06:23 -0400 Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com [IPv6:2607:f8b0:4864:20::102e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C514D2BECB; Mon, 3 Apr 2023 07:05:52 -0700 (PDT) Received: by mail-pj1-x102e.google.com with SMTP id j13so27313926pjd.1; Mon, 03 Apr 2023 07:05:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680530747; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tkuw81I/aX6b/59j6Co6HUBjcgFUTGOqJSCJO2DhDcg=; b=WuEa5G/4lBckHk7nq/DPvThZPQB4AtqBwCaNe03LCce8Z0MhcMozEMEJLbD+Bo0dYi C/jTla1EEkMqkgYTgh7HzYYFNAkE2z8CQTF35/TZWQd20OzBgLLGjXgNyotvzOAa/Iou bdJO/M+EoYeMKzv9QkqTrPskMX+bvj3VcDyMb9LlEvy0kuafQqENXnL8zPh5zy9WKByb cBwzZc8YL2YQOJeAGueAc4aIWgH+xYK7omuyQbKNiga7Ok0+WxW1qDFoHqvZR5Ybr2Vz eRURMUSlR9vXzXp/TEfFYh8ydoD17LIENSsj3X5HWVbPkfxLA7AC5DjyGNIIO+gJThlE Mclw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680530747; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tkuw81I/aX6b/59j6Co6HUBjcgFUTGOqJSCJO2DhDcg=; b=ZVVaPzh2q/xyvV7weFvHhMreBJv/WhbVAmp+Cwt9cRy/nn5s42S+s3rbRvFmUekQ33 n2wl5X9AwOGqWg+XXPlG6QW5PcF94S4EZfG1ag6+sO+NZMZ4oE+3Hl0DUVZla3PyUyek zVEGJgwQjSMuc2XIWgoj7zZAck3OaKm6j48QvCUQiSPvRD0QXFqc25Mojk+RYjQDV/r4 +fqdTgcBxsLgGLdVCr4jovdHWOfIyiKlA7C+/+1zWM1UG+6ONV6J55x7Xea40LpBavcL zXdonTguIuyWIxH96Uj8n/xXCeqHsD7NXFH8Kkcdd3mmJG6wsZjbY/mfbOWBad29Bofn hrWg== X-Gm-Message-State: AAQBX9eLAxUeUMFN+S0664MJ7Gh9NSoopUPvbAdjIBrftLcoZV2lIRhL vUNUycQp+gmKwbIr5o8wsWb/NLZ5IM4= X-Google-Smtp-Source: AKy350Z0p2YSQMmwfxiGhhWsJRHhupcmkaKVxEfihpzfkjQKsa/LZdMbwlGjQUyc15H7/2/cg07nYA== X-Received: by 2002:a17:903:807:b0:1a2:a298:6f65 with SMTP id kr7-20020a170903080700b001a2a2986f65mr10187229plb.2.1680530746444; Mon, 03 Apr 2023 07:05:46 -0700 (PDT) Received: from localhost ([198.11.178.15]) by smtp.gmail.com with ESMTPSA id bf12-20020a170902b90c00b00198fb25d09bsm6609339plb.237.2023.04.03.07.05.45 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Apr 2023 07:05:46 -0700 (PDT) From: Lai Jiangshan To: linux-kernel@vger.kernel.org Cc: Lai Jiangshan , "H. Peter Anvin" , Andi Kleen , Andrew Cooper , Andy Lutomirski , Asit Mallick , Cfir Cohen , Dan Williams , Dave Hansen , David Kaplan , David Rientjes , Dirk Hohndel , Erdem Aktas , Jan Kiszka , Jiri Slaby , Joerg Roedel , Juergen Gross , Kees Cook , Kirill Shutemov , Kuppuswamy Sathyanarayanan , Linus Torvalds , Mike Stunes , Peter Zijlstra , Raj Ashok , Sean Christopherson , Thomas Gleixner , Tom Lendacky , Tony Luck , kvm@vger.kernel.org, linux-coco@lists.linux.dev, x86@kernel.org, Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Kuppuswamy Sathyanarayanan , Brijesh Singh , Michael Roth , Venu Busireddy , "Paul E. McKenney" , "Jason A. Donenfeld" , Sami Tolvanen , Alexander Potapenko , Jiapeng Chong Subject: [RFC PATCH 6/7] x86/entry: Use atomic-IST-entry for VC Date: Mon, 3 Apr 2023 22:06:04 +0800 Message-Id: <20230403140605.540512-7-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20230403140605.540512-1-jiangshanlai@gmail.com> References: <20230403140605.540512-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Lai Jiangshan Signed-off-by: Lai Jiangshan --- arch/x86/entry/entry_64.S | 83 -------------------- arch/x86/entry/ist_entry.c | 2 +- arch/x86/include/asm/cpu_entry_area.h | 3 - arch/x86/include/asm/idtentry.h | 2 +- arch/x86/include/asm/sev.h | 14 ---- arch/x86/include/asm/traps.h | 1 - arch/x86/kernel/dumpstack_64.c | 4 +- arch/x86/kernel/nmi.c | 8 -- arch/x86/kernel/sev.c | 108 -------------------------- arch/x86/kernel/traps.c | 43 ---------- arch/x86/mm/cpu_entry_area.c | 1 - 11 files changed, 3 insertions(+), 266 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index e4ddc793f841..187d42efd288 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -621,89 +621,6 @@ _ASM_NOKPROBE(\asmsym) SYM_CODE_END(\asmsym) .endm -#ifdef CONFIG_AMD_MEM_ENCRYPT -/** - * idtentry_vc - Macro to generate entry stub for #VC - * @vector: Vector number - * @asmsym: ASM symbol for the entry point - * @cfunc: C function to be called - * - * The macro emits code to set up the kernel context for #VC. The #VC handler - * runs on an IST stack and needs to be able to cause nested #VC exceptions. - * - * To make this work the #VC entry code tries its best to pretend it doesn't use - * an IST stack by switching to the task stack if coming from user-space (which - * includes early SYSCALL entry path) or back to the stack in the IRET frame if - * entered from kernel-mode. - * - * If entered from kernel-mode the return stack is validated first, and if it is - * not safe to use (e.g. because it points to the entry stack) the #VC handler - * will switch to a fall-back stack (VC2) and call a special handler function. - * - * The macro is only used for one vector, but it is planned to be extended in - * the future for the #HV exception. - */ -.macro idtentry_vc vector asmsym cfunc -SYM_CODE_START(\asmsym) - UNWIND_HINT_IRET_REGS - ENDBR - ASM_CLAC - cld - - /* - * If the entry is from userspace, switch stacks and treat it as - * a normal entry. - */ - testb $3, CS-ORIG_RAX(%rsp) - jnz .Lfrom_usermode_switch_stack_\@ - - PUSH_AND_CLEAR_REGS - UNWIND_HINT_REGS - ENCODE_FRAME_POINTER - - /* - * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX. - * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS - */ - call paranoid_entry - - /* - * Switch off the IST stack to make it free for nested exceptions. The - * vc_switch_off_ist() function will switch back to the interrupted - * stack if it is safe to do so. If not it switches to the VC fall-back - * stack. - */ - movq %rsp, %rdi /* pt_regs pointer */ - call vc_switch_off_ist - movq %rax, %rsp /* Switch to new stack */ - - ENCODE_FRAME_POINTER - UNWIND_HINT_REGS - - /* Update pt_regs */ - movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/ - movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */ - - movq %rsp, %rdi /* pt_regs pointer */ - - call kernel_\cfunc - - /* - * No need to switch back to the IST stack. The current stack is either - * identical to the stack in the IRET frame or the VC fall-back stack, - * so it is definitely mapped even with PTI enabled. - */ - jmp paranoid_exit - - /* Switch to the regular task stack */ -.Lfrom_usermode_switch_stack_\@: - idtentry_body user_\cfunc, has_error_code=1 - -_ASM_NOKPROBE(\asmsym) -SYM_CODE_END(\asmsym) -.endm -#endif - /* * Double fault entry. Straight paranoid. No checks from which context * this comes because for the espfix induced #DF this would do the wrong diff --git a/arch/x86/entry/ist_entry.c b/arch/x86/entry/ist_entry.c index 946b3b537bd5..c0cbd4527033 100644 --- a/arch/x86/entry/ist_entry.c +++ b/arch/x86/entry/ist_entry.c @@ -44,7 +44,7 @@ static __always_inline bool identify_ist_##sym_name( \ DEFINE_IDENTIFY_IST(NMI, nmi, true) DEFINE_IDENTIFY_IST(DB, debug, true) DEFINE_IDENTIFY_IST(MCE, machine_check, IS_ENABLED(CONFIG_X86_MCE)) -DEFINE_IDENTIFY_IST(VC, vmm_communication, false) +DEFINE_IDENTIFY_IST(VC, vmm_communication, IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) static __always_inline bool identify_ist( const struct pt_regs *regs, struct cea_exception_stacks *stacks, diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h index a373e8c37e25..618aa698eb82 100644 --- a/arch/x86/include/asm/cpu_entry_area.h +++ b/arch/x86/include/asm/cpu_entry_area.h @@ -30,8 +30,6 @@ char MCE_stack[EXCEPTION_STKSZ]; \ char VC_stack_guard[guardsize]; \ char VC_stack[optional_stack_size]; \ - char VC2_stack_guard[guardsize]; \ - char VC2_stack[optional_stack_size]; \ char IST_stack_guard[guardsize]; \ char IST_stack[IST_MAIN_STKSZ]; \ char IST_top_guard[guardsize]; \ @@ -55,7 +53,6 @@ enum exception_stack_ordering { ESTACK_DB, ESTACK_MCE, ESTACK_VC, - ESTACK_VC2, ESTACK_IST, N_EXCEPTION_STACKS }; diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 01f3152ffe82..5f3250e589ec 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -466,7 +466,7 @@ __visible noinstr void func(struct pt_regs *regs, \ DECLARE_IDTENTRY(vector, func) # define DECLARE_IDTENTRY_VC(vector, func) \ - idtentry_vc vector asm_##func func + idtentry_ist vector asm_##func kernel_##func user_##func has_error_code=1 stack_offset=CEA_stacks_VC #else # define DECLARE_IDTENTRY_MCE(vector, func) \ diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h index ebc271bb6d8e..ce554b3a818d 100644 --- a/arch/x86/include/asm/sev.h +++ b/arch/x86/include/asm/sev.h @@ -135,18 +135,6 @@ struct snp_secrets_page_layout { #ifdef CONFIG_AMD_MEM_ENCRYPT extern struct static_key_false sev_es_enable_key; -extern void __sev_es_ist_enter(struct pt_regs *regs); -extern void __sev_es_ist_exit(void); -static __always_inline void sev_es_ist_enter(struct pt_regs *regs) -{ - if (static_branch_unlikely(&sev_es_enable_key)) - __sev_es_ist_enter(regs); -} -static __always_inline void sev_es_ist_exit(void) -{ - if (static_branch_unlikely(&sev_es_enable_key)) - __sev_es_ist_exit(); -} extern int sev_es_setup_ap_jump_table(struct real_mode_header *rmh); extern void __sev_es_nmi_complete(void); static __always_inline void sev_es_nmi_complete(void) @@ -198,8 +186,6 @@ bool snp_init(struct boot_params *bp); void __init __noreturn snp_abort(void); int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err); #else -static inline void sev_es_ist_enter(struct pt_regs *regs) { } -static inline void sev_es_ist_exit(void) { } static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; } static inline void sev_es_nmi_complete(void) { } static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; } diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 47ecfff2c83d..dc0da530f951 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -15,7 +15,6 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs); asmlinkage __visible notrace struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs); void __init trap_init(void); -asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs); #endif extern bool ibt_selftest(void); diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c index 3413b23fa9f1..b7ef2685f63b 100644 --- a/arch/x86/kernel/dumpstack_64.c +++ b/arch/x86/kernel/dumpstack_64.c @@ -25,7 +25,6 @@ static const char * const exception_stack_names[] = { [ ESTACK_DB ] = "#DB", [ ESTACK_MCE ] = "#MC", [ ESTACK_VC ] = "#VC", - [ ESTACK_VC2 ] = "#VC2", [ ESTACK_IST ] = "#IST", }; @@ -89,7 +88,6 @@ struct estack_pages estack_pages[CEA_ESTACK_PAGES] ____cacheline_aligned = { EPAGERANGE(DB), EPAGERANGE(MCE), EPAGERANGE(VC), - EPAGERANGE(VC2), EPAGERANGE(IST), }; @@ -100,7 +98,7 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac struct pt_regs *regs; unsigned int k; - BUILD_BUG_ON(N_EXCEPTION_STACKS != 7); + BUILD_BUG_ON(N_EXCEPTION_STACKS != 6); begin = (unsigned long)__this_cpu_read(cea_exception_stacks); /* diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index 776f4b1e395b..bafd0c7ca5b7 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -514,12 +514,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi) } nmi_restart: - /* - * Needs to happen before DR7 is accessed, because the hypervisor can - * intercept DR7 reads/writes, turning those into #VC exceptions. - */ - sev_es_ist_enter(regs); - this_cpu_write(nmi_dr7, local_db_save()); irq_state = irqentry_nmi_enter(regs); @@ -544,8 +538,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi) local_db_restore(this_cpu_read(nmi_dr7)); - sev_es_ist_exit(); - if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) write_cr2(this_cpu_read(nmi_cr2)); if (this_cpu_dec_return(nmi_state)) diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c index 679026a640ef..74d55786c353 100644 --- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -122,77 +122,6 @@ struct sev_config { static struct sev_config sev_cfg __read_mostly; -static __always_inline bool on_vc_stack(struct pt_regs *regs) -{ - unsigned long sp = regs->sp; - - /* User-mode RSP is not trusted */ - if (user_mode(regs)) - return false; - - /* SYSCALL gap still has user-mode RSP */ - if (ip_within_syscall_gap(regs)) - return false; - - return ((sp >= __this_cpu_ist_bottom_va(VC)) && (sp < __this_cpu_ist_top_va(VC))); -} - -/* - * This function handles the case when an NMI is raised in the #VC - * exception handler entry code, before the #VC handler has switched off - * its IST stack. In this case, the IST entry for #VC must be adjusted, - * so that any nested #VC exception will not overwrite the stack - * contents of the interrupted #VC handler. - * - * The IST entry is adjusted unconditionally so that it can be also be - * unconditionally adjusted back in __sev_es_ist_exit(). Otherwise a - * nested sev_es_ist_exit() call may adjust back the IST entry too - * early. - * - * The __sev_es_ist_enter() and __sev_es_ist_exit() functions always run - * on the NMI IST stack, as they are only called from NMI handling code - * right now. - */ -void noinstr __sev_es_ist_enter(struct pt_regs *regs) -{ - unsigned long old_ist, new_ist; - - /* Read old IST entry */ - new_ist = old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]); - - /* - * If NMI happened while on the #VC IST stack, set the new IST - * value below regs->sp, so that the interrupted stack frame is - * not overwritten by subsequent #VC exceptions. - */ - if (on_vc_stack(regs)) - new_ist = regs->sp; - - /* - * Reserve additional 8 bytes and store old IST value so this - * adjustment can be unrolled in __sev_es_ist_exit(). - */ - new_ist -= sizeof(old_ist); - *(unsigned long *)new_ist = old_ist; - - /* Set new IST entry */ - this_cpu_write(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC], new_ist); -} - -void noinstr __sev_es_ist_exit(void) -{ - unsigned long ist; - - /* Read IST entry */ - ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]); - - if (WARN_ON(ist == __this_cpu_ist_top_va(VC))) - return; - - /* Read back old IST entry and write it to the TSS */ - this_cpu_write(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC], *(unsigned long *)ist); -} - /* * Nothing shall interrupt this code path while holding the per-CPU * GHCB. The backup GHCB is only for NMIs interrupting this path. @@ -1841,26 +1770,6 @@ static __always_inline void vc_forward_exception(struct es_em_ctxt *ctxt) } } -static __always_inline bool is_vc2_stack(unsigned long sp) -{ - return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < __this_cpu_ist_top_va(VC2)); -} - -static __always_inline bool vc_from_invalid_context(struct pt_regs *regs) -{ - unsigned long sp, prev_sp; - - sp = (unsigned long)regs; - prev_sp = regs->sp; - - /* - * If the code was already executing on the VC2 stack when the #VC - * happened, let it proceed to the normal handling routine. This way the - * code executing on the VC2 stack can cause #VC exceptions to get handled. - */ - return is_vc2_stack(sp) && !is_vc2_stack(prev_sp); -} - static bool vc_raw_handle_exception(struct pt_regs *regs, unsigned long error_code) { struct ghcb_state state; @@ -1930,23 +1839,6 @@ DEFINE_IDTENTRY_VC_KERNEL(exc_vmm_communication) { irqentry_state_t irq_state; - /* - * With the current implementation it is always possible to switch to a - * safe stack because #VC exceptions only happen at known places, like - * intercepted instructions or accesses to MMIO areas/IO ports. They can - * also happen with code instrumentation when the hypervisor intercepts - * #DB, but the critical paths are forbidden to be instrumented, so #DB - * exceptions currently also only happen in safe places. - * - * But keep this here in case the noinstr annotations are violated due - * to bug elsewhere. - */ - if (unlikely(vc_from_invalid_context(regs))) { - instrumentation_begin(); - panic("Can't handle #VC exception from unsupported context\n"); - instrumentation_end(); - } - /* * Handle #DB before calling into !noinstr code to avoid recursive #DB. */ diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index d317dc3d06a3..6c697c175f7a 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -864,49 +864,6 @@ asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *eregs) return regs; } -#ifdef CONFIG_AMD_MEM_ENCRYPT -asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *regs) -{ - unsigned long sp, *stack; - struct stack_info info; - struct pt_regs *regs_ret; - - /* - * In the SYSCALL entry path the RSP value comes from user-space - don't - * trust it and switch to the current kernel stack - */ - if (ip_within_syscall_gap(regs)) { - sp = this_cpu_read(pcpu_hot.top_of_stack); - goto sync; - } - - /* - * From here on the RSP value is trusted. Now check whether entry - * happened from a safe stack. Not safe are the entry or unknown stacks, - * use the fall-back stack instead in this case. - */ - sp = regs->sp; - stack = (unsigned long *)sp; - - if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY || - info.type > STACK_TYPE_EXCEPTION_LAST) - sp = __this_cpu_ist_top_va(VC2); - -sync: - /* - * Found a safe stack - switch to it as if the entry didn't happen via - * IST stack. The code below only copies pt_regs, the real switch happens - * in assembly code. - */ - sp = ALIGN_DOWN(sp, 8) - sizeof(*regs_ret); - - regs_ret = (struct pt_regs *)sp; - *regs_ret = *regs; - - return regs_ret; -} -#endif - asmlinkage __visible noinstr struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs) { struct pt_regs tmp, *new_stack; diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c index 62341cb819ab..7df1301ec343 100644 --- a/arch/x86/mm/cpu_entry_area.c +++ b/arch/x86/mm/cpu_entry_area.c @@ -153,7 +153,6 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu) if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) { if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) { cea_map_stack(VC); - cea_map_stack(VC2); } } } From patchwork Mon Apr 3 14:06:05 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lai Jiangshan X-Patchwork-Id: 13198230 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1CD71C761AF for ; Mon, 3 Apr 2023 14:06:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232844AbjDCOGs (ORCPT ); Mon, 3 Apr 2023 10:06:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44302 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232501AbjDCOGZ (ORCPT ); Mon, 3 Apr 2023 10:06:25 -0400 Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2A2AB1BF62; Mon, 3 Apr 2023 07:05:58 -0700 (PDT) Received: by mail-pl1-x630.google.com with SMTP id f22so23983079plr.0; Mon, 03 Apr 2023 07:05:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680530752; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=YZbI1OT962diLBE3c0UKswFvsxJYd8uxPutPjIi22EM=; b=gg73Vut+GV0UcPC6llRZsuyu5Jd6lVMMWidXcUhKCQkTFDjmB1I3CR7L7cv6npN6Bz TDGmMpIIwWIeRYqDPMTQmAMQ5eEmAODmuwuBWc2H1LVF9LKq9CrqC9+vIiljdTr8WwPu NiZPBDIyWqaGOujmiCVzwSviTwPZeWb7FPcC+Hz3Wwmo3qhQ8wM8gu7RH0bviObqDuiq 0QoPQY1n73HM0oeDY9MlPIoumL7cTzZaWhdL0G5xHUXb1YXridFYBw+mSr7sC0X/01ng mvEP25mCX2fZR5qaR31SbLNAvEzicYlYxwRrf/4i5EoATXWtRrwq5I0q/dwRbf77bsEY hXeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680530752; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YZbI1OT962diLBE3c0UKswFvsxJYd8uxPutPjIi22EM=; b=4pIhEUqjREUMtdAm2bu8lD0WXesF5mgW+NeA/+0HNUQWOpnF3QWDjP3eRW+qtY+g7M N8x80mNL5ZlAI5nmOAjAolDg9SF0kJ93FHRbd2WE/wCR4RuQvUHtc9RukwfFUqIRf4FX r7inn27+uC+2IGmbW0BWR+g+J0NsJYlvvB0orx+pHYbm0oBxjITkQRr9vOKN3YEgEiW4 sJqPOA1+wszraxvkqN0mkELFEeAMwFzc8ehEbZSVuQEqPHCNAccxT6qJzZwOQ31/Jsay cU75hLxYwgNlGZlS6xcJd+iuqKkHgJ8lQIZNNWAg7ttoyZlMzUOGF49s7IDUqviuKWsQ nmAQ== X-Gm-Message-State: AAQBX9ckv9+bgcUd8HrnRZkZqZIStr5CjjWvyj37sSGBd2QvvH+WBbjo CLKPq5ML/W4mS6TwGDY+RBnzz3J9K+Q= X-Google-Smtp-Source: AKy350Z/M0G8DWcTbUz/MvEQPhAv0qICDkbBQNISY7ldpM63qmWEClM91zpv9zVNXlFsI5nLUkvEfQ== X-Received: by 2002:a17:90a:1de:b0:234:b964:5703 with SMTP id 30-20020a17090a01de00b00234b9645703mr39198970pjd.18.1680530751086; Mon, 03 Apr 2023 07:05:51 -0700 (PDT) Received: from localhost ([198.11.176.14]) by smtp.gmail.com with ESMTPSA id q9-20020a170902788900b001990028c0c9sm6626551pll.68.2023.04.03.07.05.49 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Apr 2023 07:05:50 -0700 (PDT) From: Lai Jiangshan To: linux-kernel@vger.kernel.org Cc: Lai Jiangshan , "H. Peter Anvin" , Andi Kleen , Andrew Cooper , Andy Lutomirski , Asit Mallick , Cfir Cohen , Dan Williams , Dave Hansen , David Kaplan , David Rientjes , Dirk Hohndel , Erdem Aktas , Jan Kiszka , Jiri Slaby , Joerg Roedel , Juergen Gross , Kees Cook , Kirill Shutemov , Kuppuswamy Sathyanarayanan , Linus Torvalds , Mike Stunes , Peter Zijlstra , Raj Ashok , Sean Christopherson , Thomas Gleixner , Tom Lendacky , Tony Luck , kvm@vger.kernel.org, linux-coco@lists.linux.dev, x86@kernel.org, Paolo Bonzini , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" Subject: [RFC PATCH 7/7] x86/entry: Test atomic-IST-entry via KVM Date: Mon, 3 Apr 2023 22:06:05 +0800 Message-Id: <20230403140605.540512-8-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20230403140605.540512-1-jiangshanlai@gmail.com> References: <20230403140605.540512-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Lai Jiangshan Not for inclusion. It is not part of the atomic-IST-entry patch. Test it with a VM with only one vCPU. Lauch the VM with the uncompressed vmlinux. The test injects super exceptions (MCE NMI DB) randomly. The test make the ratio of the occurence of nested exceptions (different vectors) higher. The test reables the nested of the same vector very hurry. When the cpu reaches a commit_ip, all the events are reabled. usage: parameters="$parameters exc_nmi_handler=0x$(readelf -W -s vmlinux | awk '/ exc_nmi$/ { print $2}')" parameters="$parameters asm_exc_nmi_handler=0x$(readelf -W -s vmlinux | awk '/ asm_exc_nmi$/ { print $2}')" parameters="$parameters asm_exc_db_handler=0x$(readelf -W -s vmlinux | awk '/ asm_exc_debug$/ { print $2}')" parameters="$parameters asm_exc_mce_handler=0x$(readelf -W -s vmlinux | awk '/ asm_exc_machine_check$/ { print $2}')" parameters="$parameters exc_db_kernel_handler=0x$(readelf -W -s vmlinux | awk '/ exc_debug$/ { print $2}')" parameters="$parameters exc_db_user_handler=0x$(readelf -W -s vmlinux | awk '/ noist_exc_debug$/ { print $2}')" parameters="$parameters exc_mce_kernel_handler=0x$(readelf -W -s vmlinux | awk '/ exc_machine_check$/ { print $2}')" parameters="$parameters exc_mce_user_handler=0x$(readelf -W -s vmlinux | awk '/ noist_exc_machine_check$/ { print $2}')" parameters="$parameters asm_exc_nmi_commit_ip=0x$(readelf -W -s vmlinux | awk '/ commit_asm_exc_nmi$/ { print $2}')" parameters="$parameters asm_exc_db_commit_ip=0x$(readelf -W -s vmlinux | awk '/ commit_asm_exc_debug$/ { print $2}')" parameters="$parameters asm_exc_mce_commit_ip=0x$(readelf -W -s vmlinux | awk '/ commit_asm_exc_machine_check$/ { print $2}')" parameters="$parameters native_irq_return_iret_ip=0x$(readelf -W -s vmlinux | awk '/ native_irq_return_iret$/ { print $2}')" parameters="$parameters ist_emul_test=1" insmod arch/x86/kvm/kvm.ko insmod arch/x86/kvm/kvm-intel.ko $parameters Signed-off-by: Lai Jiangshan --- arch/x86/kvm/vmx/vmx.c | 441 ++++++++++++++++++++++++++++++++++++++++- arch/x86/kvm/x86.c | 10 +- 2 files changed, 447 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index bcac3efcde41..6f0d574a137a 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1252,6 +1252,441 @@ void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel, } } +#include + +// readelf -W -s vmlinux | awk '/ asm_exc_double_fault$/ { print $2}' +static unsigned long asm_exc_df_handler; +module_param(asm_exc_df_handler, ulong, S_IRUGO); + +// readelf -W -s vmlinux | awk '/ asm_exc_nmi$/ { print $2}' +static unsigned long asm_exc_nmi_handler; +module_param(asm_exc_nmi_handler, ulong, S_IRUGO); + +// readelf -W -s vmlinux | awk '/ asm_exc_debug$/ { print $2}' +static unsigned long asm_exc_db_handler; +module_param(asm_exc_db_handler, ulong, S_IRUGO); + +// readelf -W -s vmlinux | awk '/ asm_exc_machine_check$/ { print $2}' +static unsigned long asm_exc_mce_handler; +module_param(asm_exc_mce_handler, ulong, S_IRUGO); + +// readelf -W -s vmlinux | awk '/ exc_nmi$/ { print $2}' +static unsigned long exc_nmi_handler; +module_param(exc_nmi_handler, ulong, S_IRUGO); + +// readelf -W -s vmlinux | awk '/ exc_debug$/ { print $2}' +static unsigned long exc_db_kernel_handler; +module_param(exc_db_kernel_handler, ulong, S_IRUGO); + +// readelf -W -s vmlinux | awk '/ noist_exc_debug$/ { print $2}' +static unsigned long exc_db_user_handler; +module_param(exc_db_user_handler, ulong, S_IRUGO); + +// readelf -W -s vmlinux | awk '/ exc_machine_check$/ { print $2}' +static unsigned long exc_mce_kernel_handler; +module_param(exc_mce_kernel_handler, ulong, S_IRUGO); + +// readelf -W -s vmlinux | awk '/ noist_exc_machine_check$/ { print $2}' +static unsigned long exc_mce_user_handler; +module_param(exc_mce_user_handler, ulong, S_IRUGO); + +// readelf -W -s vmlinux | awk '/ commit_asm_exc_nmi$/ { print $2}' +static unsigned long asm_exc_nmi_commit_ip; +module_param(asm_exc_nmi_commit_ip, ulong, S_IRUGO); + +// readelf -W -s vmlinux | awk '/ commit_asm_exc_debug$/ { print $2}' +static unsigned long asm_exc_db_commit_ip; +module_param(asm_exc_db_commit_ip, ulong, S_IRUGO); + +// readelf -W -s vmlinux | awk '/ commit_asm_exc_machine_check$/ { print $2}' +static unsigned long asm_exc_mce_commit_ip; +module_param(asm_exc_mce_commit_ip, ulong, S_IRUGO); + +// readelf -W -s vmlinux | awk '/ native_irq_return_iret$/ { print $2}' +static unsigned long native_irq_return_iret_ip; +module_param(native_irq_return_iret_ip, ulong, S_IRUGO); + +#define MAX_NESTING 50 +#define IST_START_RATE 512 /* rate of queuing an IST per vm-exit if no IST queued */ +#define IST_NESTED_RATE 256 /* rate of nesting an IST per instruction during handling IST */ + +static int instruction_count; +static int ist_depth, ist_atomic_depth; +static int ist_nmi_blocking, ist_db_blocking, ist_mce_blocking; +static u8 pending_vector[MAX_NESTING]; +static unsigned long ist_interrupted_rip[MAX_NESTING]; +static unsigned long commit_ip; + +#define TEST_STOPPED 0 +#define TEST_REQUESTED 1 +#define TEST_ONGOING 2 +static int ist_emul_test = TEST_STOPPED; +// set ist_emul_test = 1 (TEST_REQUESTED) when loading the module to enable the test +// in a script which also sets all the above symbol addresses +module_param(ist_emul_test, int, S_IRUGO); + +static bool test_too_early_nesting_bug; +module_param(test_too_early_nesting_bug, bool, S_IRUGO); + +extern int ist_emul_test_singlestep; + +static bool ist_emul_injecting; +static bool ist_emul_returning; + +static void print_pending_vectors(void) +{ + int i; + + pr_cont("current vectors:"); + for (i = 0; i < ist_depth; i++) { + switch (pending_vector[i]) { + case MC_VECTOR: pr_cont(" #MC"); break; + case DB_VECTOR: pr_cont(" #DB"); break; + case 2: pr_cont(" NMI"); break; + default: WARN_ON(1); + } + } + pr_cont(";\n"); +} + +#define DR7_EN(i, type, len) (((type | len) & 0xf) << (DR_CONTROL_SHIFT + i * DR_CONTROL_SIZE)) |\ + (DR_GLOBAL_ENABLE << (i * DR_ENABLE_SIZE)) + +#define DR7_X(i) DR7_EN(i, X86_BREAKPOINT_EXECUTE, X86_BREAKPOINT_LEN_X) + +static void watch_injecting(struct kvm_vcpu *vcpu) +{ + kvm_set_dr(vcpu, 0, asm_exc_nmi_handler); + kvm_set_dr(vcpu, 1, asm_exc_db_handler); + kvm_set_dr(vcpu, 2, asm_exc_mce_handler); + kvm_set_dr(vcpu, 3, asm_exc_df_handler); + kvm_set_dr(vcpu, 6, DR6_RESERVED); + kvm_set_dr(vcpu, 7, DR7_X(0) | DR7_X(1) | DR7_X(2) | DR7_X(3)); +} + +static void finish_watching_injecting(struct kvm_vcpu *vcpu) +{ + kvm_set_dr(vcpu, 0, 0); + kvm_set_dr(vcpu, 1, 0); + kvm_set_dr(vcpu, 2, 0); + kvm_set_dr(vcpu, 3, 0); + kvm_set_dr(vcpu, 6, DR6_RESERVED); + kvm_set_dr(vcpu, 7, 0); +} + +static void watch_handler_done(struct kvm_vcpu *vcpu) +{ + kvm_set_dr(vcpu, 0, ist_interrupted_rip[ist_depth-1]); + kvm_set_dr(vcpu, 1, asm_exc_df_handler); + //kvm_set_dr(vcpu, 2, exc_nmi_handler); + //kvm_set_dr(vcpu, 3, native_irq_return_iret_ip); + kvm_set_dr(vcpu, 6, DR6_RESERVED); + kvm_set_dr(vcpu, 7, DR7_X(0) | DR7_X(1)); + //kvm_set_dr(vcpu, 7, DR7_X(0) | DR7_X(1) | DR7_X(2) | DR7_X(3)); +} + +static void ist_emul_test_start_singlestep(struct kvm_vcpu *vcpu) +{ + ist_emul_test_singlestep = 1; +} + +static void ist_emul_test_stop_singlestep(struct kvm_vcpu *vcpu) +{ + ist_emul_test_singlestep = 0; + watch_handler_done(vcpu); + finish_watching_injecting(vcpu); +} + +static void ist_emul_test_stop(struct kvm_vcpu *vcpu) +{ + ist_emul_test = TEST_STOPPED; + ist_emul_test_stop_singlestep(vcpu); +} + +static bool is_entering_ip(struct kvm_vcpu *vcpu) +{ + unsigned long rip = kvm_rip_read(vcpu); + + return rip == asm_exc_mce_handler || rip == asm_exc_db_handler || rip == asm_exc_nmi_handler; +} + +static void emul_ist_bug_in_singlestep(struct kvm_vcpu *vcpu) +{ + unsigned long rip = kvm_rip_read(vcpu); + + if (WARN_ON_ONCE(!ist_depth)) { + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + return; + } + + if (rip == asm_exc_df_handler || instruction_count > 1000) { + ist_emul_test_stop(vcpu); + return; + } +} + +static unsigned long read_one(struct kvm_vcpu *vcpu, unsigned long addr) +{ + unsigned long ret; + struct x86_exception e; + if (kvm_read_guest_virt(vcpu, addr, &ret, sizeof(ret), &e) != X86EMUL_CONTINUE) + return 0xf000f123f456f789; + return ret; +} + +static void emul_ist_kctx_handler_return(struct kvm_vcpu *vcpu) +{ + unsigned long rsp = kvm_rsp_read(vcpu); + unsigned long ret_rip; + struct x86_exception e; + + if (kvm_read_guest_virt(vcpu, rsp, &ret_rip, sizeof(ret_rip), &e) != X86EMUL_CONTINUE) + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + + /* it is faked #MC #DB NMI, make the guest handle it as if the handler in kernel context was an NOP */ + kvm_rsp_write(vcpu, rsp+8); + kvm_rip_write(vcpu, ret_rip); +} + +static void emul_ist_kctx_handler_done(struct kvm_vcpu *vcpu) +{ + unsigned long rip = kvm_rip_read(vcpu); + int vector = -1; + + if (rip == exc_nmi_handler) + vector = NMI_VECTOR; + if (rip == exc_db_kernel_handler || rip == exc_db_user_handler) + vector = DB_VECTOR; + if (rip == exc_mce_kernel_handler || rip == exc_mce_user_handler) + vector = MC_VECTOR; + + if (vector > 0) { + if (WARN_ON(vector != pending_vector[ist_depth-1])) + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + pr_info("Enter the handler in kernel context; "); + print_pending_vectors(); + emul_ist_kctx_handler_return(vcpu); + } +} + +static void emul_ist_emul_enable_delivering(struct kvm_vcpu *vcpu) +{ + unsigned long rip = kvm_rip_read(vcpu); + + if (rip == commit_ip) { + ist_atomic_depth = 0; + ist_nmi_blocking = 0; + ist_db_blocking = 0; + ist_mce_blocking = 0; + } +} + +static bool test_nesting_bug(void) +{ + return test_too_early_nesting_bug && ((get_random_u32() % 1024) == 0); +} + +static void __ist_emul_test_queue_one(struct kvm_vcpu *vcpu, int vector) +{ + if (!ist_atomic_depth) + ist_interrupted_rip[ist_depth] = kvm_rip_read(vcpu); + else + ist_interrupted_rip[ist_depth] = commit_ip; + pending_vector[ist_depth] = vector; + ist_depth++; + ist_atomic_depth++; + instruction_count = 0; + pr_info("queue an IST exception, depth:%d, atomic_depth:%d; ", ist_depth, ist_atomic_depth); + print_pending_vectors(); + + // per intel SDM, X86_EFLAGS_TF is cleared after the exception is diliverred. + // use hw breakpoint to force immediate exit + // and add X86_EFLAGS_TF after breakpoint hits + ist_emul_injecting = true; + watch_injecting(vcpu); + pr_info("interrupt ip: %lx return ip: %lx\n", kvm_rip_read(vcpu), ist_interrupted_rip[ist_depth-1]); +} + +static void ist_emul_test_queue_one(struct kvm_vcpu *vcpu) +{ + u32 which = get_random_u32() % 3; + + if (ist_depth >= MAX_NESTING) + return; + + if (which <= 0 && (!ist_mce_blocking || test_nesting_bug())) { + kvm_queue_exception(vcpu, MC_VECTOR); + ist_mce_blocking = 1; + __ist_emul_test_queue_one(vcpu, MC_VECTOR); + commit_ip = asm_exc_mce_commit_ip; + return; + } + + if (which <= 1 && (!ist_db_blocking || test_nesting_bug())) { + kvm_queue_exception_p(vcpu, DB_VECTOR, DR6_BS); + ist_db_blocking = 1; + __ist_emul_test_queue_one(vcpu, DB_VECTOR); + commit_ip = asm_exc_db_commit_ip; + return; + } + + if (which <= 2 && (!ist_nmi_blocking || test_nesting_bug())) { + vmx_set_nmi_mask(vcpu, false); + kvm_inject_nmi(vcpu); + ist_nmi_blocking = 1; + __ist_emul_test_queue_one(vcpu, 2); + commit_ip = asm_exc_nmi_commit_ip; + return; + } +} + +static void __handle_ist_emul_test_singlestep(struct kvm_vcpu *vcpu) +{ + instruction_count++; + //pr_info("singlestep ip: %lx\n", kvm_rip_read(vcpu)); + emul_ist_emul_enable_delivering(vcpu); + emul_ist_kctx_handler_done(vcpu); + emul_ist_bug_in_singlestep(vcpu); + + /* try to nest an IST exception if the previous is not done */ + if (!ist_emul_test_singlestep) + return; + + // no RF, the return rip need to be watched. + // #DB during a repeated string instruction will get RF (sync_regs()) + if (vmx_get_rflags(vcpu) & X86_EFLAGS_RF) + return; + + if (get_random_u32() % IST_NESTED_RATE) + return; + + ist_emul_test_queue_one(vcpu); +} + +static bool handle_ist_emul_test_singlestep(struct kvm_vcpu *vcpu, unsigned long dr6) +{ + unsigned long rip; + + if (!ist_emul_test_singlestep) + return false; + + //kvm_get_dr(vcpu, 6, &dr6); + //pr_info("singlestep rip %lx dr6 %lx\n", kvm_rip_read(vcpu), dr6); + //kvm_set_dr(vcpu, 6, DR6_RESERVED); + if (WARN_ON(!ist_emul_injecting && !ist_emul_returning && is_entering_ip(vcpu))) { + ist_emul_test_stop(vcpu); + return true; + } + if (ist_emul_injecting) { + //pr_info("xxx ip: %lx\n", kvm_rip_read(vcpu)); + if (WARN_ON(!is_entering_ip(vcpu))) { + ist_emul_test_stop(vcpu); + return true; + } + ist_emul_injecting = false; + finish_watching_injecting(vcpu); + if (0) + read_one(vcpu, kvm_rsp_read(vcpu)); + //pr_info("rip in ret %lx\n", read_one(vcpu, kvm_rsp_read(vcpu))); + //watch_handler_done(vcpu); + //if (vmx_get_rflags(vcpu) & X86_EFLAGS_TF) + // pr_info("xxxxxxxxxxxxxxxxxxxxxx\n"); + vmx_set_rflags(vcpu, vmx_get_rflags(vcpu) | X86_EFLAGS_TF); + } + if (ist_emul_returning) { + rip = kvm_rip_read(vcpu); + if (rip != ist_interrupted_rip[ist_depth-1]) { + pr_info("rip %lx expect rip %lx\n", rip, ist_interrupted_rip[ist_depth-1]); + WARN_ON(1); + ist_emul_test_stop(vcpu); + return true; + } + ist_depth--; + pr_info("IST exception returned, depth:%d; ", ist_depth); + print_pending_vectors(); + + // print instruction_count for adjusting IST_NESTED_RATE + // to contrl the average ist_depth for the the test. + //pr_info("instruction_count %d\n", instruction_count); + instruction_count = 0; + ist_emul_returning = false; + + if (!ist_depth) { + ist_emul_test_stop_singlestep(vcpu); + return true; + } + } + + + __handle_ist_emul_test_singlestep(vcpu); + + rip = kvm_rip_read(vcpu); + if (!ist_emul_injecting && rip == native_irq_return_iret_ip) { + pr_info("IST exception is going to return\n"); + //watch_handler_done(vcpu); + ist_emul_returning = true; + } + return true; +} + +static void handle_ist_emul_test(struct kvm_vcpu *vcpu) +{ + //if (ist_emul_test_singlestep) + // pr_info("vmexit ip: %lx\n", kvm_rip_read(vcpu)); + + if (ist_emul_test == TEST_STOPPED) + return; + + if (ist_emul_test == TEST_REQUESTED) { + /* the test can't only be started until the guest's booting process is done. */ + if (vmx_get_cpl(vcpu)) + ist_emul_test = TEST_ONGOING; + else + return; + } + + /* Don't start the test if the test is already started or someone + * else is using the X86_EFLAGS_TF */ + // no RF, the return rip need to be watched. + if (vmx_get_rflags(vcpu) & (X86_EFLAGS_TF | X86_EFLAGS_RF)) + return; + + if (ist_emul_test_singlestep) { + //kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + return; + } + + if (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & + (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)) + return; + + if (vcpu->arch.nmi_injected || vcpu->arch.interrupt.injected || + vcpu->arch.exception.pending || vcpu->arch.exception.injected) + return; + + pr_info_once("%lx %lx %lx %lx %lx %lx\n", vcpu->arch.eff_db[0], vcpu->arch.eff_db[1], + vcpu->arch.eff_db[2], vcpu->arch.eff_db[3], vcpu->arch.dr6, vcpu->arch.dr7); + if (vcpu->arch.eff_db[0] || vcpu->arch.eff_db[1] || + vcpu->arch.eff_db[2] || vcpu->arch.eff_db[3] || + ((vcpu->arch.dr6 | DR6_RESERVED) != DR6_RESERVED) || + ((vcpu->arch.dr7 | DR7_FIXED_1) != DR7_FIXED_1)) + return; + + if (WARN_ON_ONCE(ist_depth)) { + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + return; + } + + /* XXX: don't start the testing when the guest is handling real #MC #DB NMI. + * But how to detect it? Just skip the detecting since is a test. */ + + if (get_random_u32() % IST_START_RATE) + return; + + ist_emul_test_start_singlestep(vcpu); + ist_emul_test_queue_one(vcpu); +} + void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -5258,6 +5693,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu) switch (ex_no) { case DB_VECTOR: dr6 = vmx_get_exit_qual(vcpu); + if (handle_ist_emul_test_singlestep(vcpu, dr6)) + return 1; if (!(vcpu->guest_debug & (KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP))) { /* @@ -6579,6 +7016,8 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) { int ret = __vmx_handle_exit(vcpu, exit_fastpath); + if (ret > 0) + handle_ist_emul_test(vcpu); /* * Exit to user space when bus lock detected to inform that there is * a bus lock in guest. @@ -7290,7 +7729,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) * vmentry fails as it then expects bit 14 (BS) in pending debug * exceptions being set, but that's not correct for the guest debugging * case. */ - if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) + if (ist_emul_test_singlestep || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)) vmx_set_interrupt_shadow(vcpu, 0); kvm_load_guest_xsave_state(vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7713420abab0..205a4d989046 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -814,6 +814,7 @@ void kvm_inject_nmi(struct kvm_vcpu *vcpu) atomic_inc(&vcpu->arch.nmi_queued); kvm_make_request(KVM_REQ_NMI, vcpu); } +EXPORT_SYMBOL_GPL(kvm_inject_nmi); void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code) { @@ -12810,12 +12811,15 @@ bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip) } EXPORT_SYMBOL_GPL(kvm_is_linear_rip); +int ist_emul_test_singlestep; +EXPORT_SYMBOL_GPL(ist_emul_test_singlestep); + unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu) { unsigned long rflags; rflags = static_call(kvm_x86_get_rflags)(vcpu); - if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) + if (ist_emul_test_singlestep || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)) rflags &= ~X86_EFLAGS_TF; return rflags; } @@ -12823,8 +12827,8 @@ EXPORT_SYMBOL_GPL(kvm_get_rflags); static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) { - if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP && - kvm_is_linear_rip(vcpu, vcpu->arch.singlestep_rip)) + if (ist_emul_test_singlestep || ((vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) && + kvm_is_linear_rip(vcpu, vcpu->arch.singlestep_rip))) rflags |= X86_EFLAGS_TF; static_call(kvm_x86_set_rflags)(vcpu, rflags); }