From patchwork Mon Nov 21 00:26:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13050220 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7AD79C433FE for ; Mon, 21 Nov 2022 00:28:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 16D708E0003; Sun, 20 Nov 2022 19:28:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 11E0C8E0001; Sun, 20 Nov 2022 19:28:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F27038E0003; Sun, 20 Nov 2022 19:28:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E37198E0001 for ; Sun, 20 Nov 2022 19:28:18 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id BB7ACC06EA for ; Mon, 21 Nov 2022 00:28:18 +0000 (UTC) X-FDA: 80155562676.02.853BE0E Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by imf22.hostedemail.com (Postfix) with ESMTP id 26114C0008 for ; Mon, 21 Nov 2022 00:28:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1668990498; x=1700526498; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=B4X5vcXsxt6uAK36/XCY9Qp+zNxFIBT0eQmTkgHAVxg=; b=WCzjLYoTxFwDyTk6Yd/9PolPqZgzMfIo2sI83f4uWxqHPMX6Wo8ks36p QromQ+ZT/rghN8/OoLRfH6fRigxCZ09V2NWWCl4j5j+htPK7LD1n3ijxh 1VOeYDBWd9vUcRRFMIQW1fpeiwxGT0rfoRZirW0wrAVA2l8L48nhA8lCt 9jYqtXrYM547StPsy5WBjX3Bbp7HiFzWeU3GR6swCbBTHpGIVG4LfnY7g sg/pP+gWoTLjTvqQdtoa7AAcey3SptLA/kVUgxU2ZXn1zOEAl4PbDUqlH BAurcnJPaqd5vg5s6uuOhU1txVJVBT0rVoznap1sxQh1IIjYccbnQSJFY w==; X-IronPort-AV: E=McAfee;i="6500,9779,10537"; a="296803752" X-IronPort-AV: E=Sophos;i="5.96,180,1665471600"; d="scan'208";a="296803752" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2022 16:28:17 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10537"; a="729825557" X-IronPort-AV: E=Sophos;i="5.96,180,1665471600"; d="scan'208";a="729825557" Received: from tomnavar-mobl.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.209.176.15]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Nov 2022 16:28:13 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, peterz@infradead.org, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v7 19/20] x86/virt/tdx: Flush cache in kexec() when TDX is enabled Date: Mon, 21 Nov 2022 13:26:41 +1300 Message-Id: X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1668990498; a=rsa-sha256; cv=none; b=DIIrOrpD4td70SU2AmVRdiblcqMEB/OEAO/I3icyqM89u6UQmZdPmcatSxtYnlYlQ1Gjon 8uyxP18QLzBwQ+h/9zGQcOhqsbeG1KI8Ox+mNGybXI30Cv/7yRWG6rNalRdZp81GClSm9B mJkMM+bq6p9UqY+dYSLQeKy3fk32SnA= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=WCzjLYoT; spf=pass (imf22.hostedemail.com: domain of kai.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=kai.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1668990498; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0HsaI+F+Q6SMDl5C9timeO1Ls7uwIsud4GJwh7yr45s=; b=sH9LsAGwxieWTUhRadk8sPXRHBXSUOCoLV5Oesdk6tTEBy8R5zfPyU3tKoE6KpuKnlBTCB yJbnd7Vkr1PRBhwV/kqGrFmC63jhW0Jq2Smhr2wEAUYeSkb613pYcbNrADxgqBNVy8aJHJ 8K7ySXi1yy4w6LHrkgODM8AdoANmstY= X-Rspam-User: X-Stat-Signature: dohy4aozdh78inzrz5jaodsjsfiy4s7r X-Rspamd-Queue-Id: 26114C0008 Authentication-Results: imf22.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=WCzjLYoT; spf=pass (imf22.hostedemail.com: domain of kai.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=kai.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam07 X-HE-Tag: 1668990497-174395 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: There are two problems in terms of using kexec() to boot to a new kernel when the old kernel has enabled TDX: 1) Part of the memory pages are still TDX private pages (i.e. metadata used by the TDX module, and any TDX guest memory if kexec() happens when there's any TDX guest alive). 2) There might be dirty cachelines associated with TDX private pages. Because the hardware doesn't guarantee cache coherency among different KeyIDs, the old kernel needs to flush cache (of those TDX private pages) before booting to the new kernel. Also, reading TDX private page using any shared non-TDX KeyID with integrity-check enabled can trigger #MC. Therefore ideally, the kernel should convert all TDX private pages back to normal before booting to the new kernel. However, this implementation doesn't convert TDX private pages back to normal in kexec() because of below considerations: 1) The kernel doesn't have existing infrastructure to track which pages are TDX private pages. 2) The number of TDX private pages can be large, and converting all of them (cache flush + using MOVDIR64B to clear the page) in kexec() can be time consuming. 3) The new kernel will almost only use KeyID 0 to access memory. KeyID 0 doesn't support integrity-check, so it's OK. 4) The kernel doesn't (and may never) support MKTME. If any 3rd party kernel ever supports MKTME, it should do MOVDIR64B to clear the page with the new MKTME KeyID (just like TDX does) before using it. Therefore, this implementation just flushes cache to make sure there are no stale dirty cachelines associated with any TDX private KeyIDs before booting to the new kernel, otherwise they may silently corrupt the new kernel. Following SME support, use wbinvd() to flush cache in stop_this_cpu(). Theoretically, cache flush is only needed when the TDX module has been initialized. However initializing the TDX module is done on demand at runtime, and it takes a mutex to read the module status. Just check whether TDX is enabled by BIOS instead to flush cache. Also, the current TDX module doesn't play nicely with kexec(). The TDX module can only be initialized once during its lifetime, and there is no ABI to reset the module to give a new clean slate to the new kernel. Therefore ideally, if the TDX module is ever initialized, it's better to shut it down. The new kernel won't be able to use TDX anyway (as it needs to go through the TDX module initialization process which will fail immediately at the first step). However, shutting down the TDX module requires all CPUs being in VMX operation, but there's no such guarantee as kexec() can happen at any time (i.e. when KVM is not even loaded). So just do nothing but leave leave the TDX module open. Reviewed-by: Isaku Yamahata Signed-off-by: Kai Huang --- v6 -> v7: - Improved changelog to explain why don't convert TDX private pages back to normal. --- arch/x86/kernel/process.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index c21b7347a26d..0cc84977dc62 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -765,8 +765,14 @@ void __noreturn stop_this_cpu(void *dummy) * * Test the CPUID bit directly because the machine might've cleared * X86_FEATURE_SME due to cmdline options. + * + * Similar to SME, if the TDX module is ever initialized, the + * cachelines associated with any TDX private KeyID must be flushed + * before transiting to the new kernel. The TDX module is initialized + * on demand, and it takes the mutex to read its status. Just check + * whether TDX is enabled by BIOS instead to flush cache. */ - if (cpuid_eax(0x8000001f) & BIT(0)) + if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled()) native_wbinvd(); for (;;) { /*