From patchwork Mon Jun 26 14:12:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13292987 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FEACEB64D7 for ; Mon, 26 Jun 2023 14:15:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 257978D000F; Mon, 26 Jun 2023 10:15:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1E0A88D0001; Mon, 26 Jun 2023 10:15:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 00BE48D000F; Mon, 26 Jun 2023 10:15:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id DD1668D0001 for ; Mon, 26 Jun 2023 10:15:42 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id BB2B41407BA for ; Mon, 26 Jun 2023 14:15:33 +0000 (UTC) X-FDA: 80945096946.21.B37142B Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by imf14.hostedemail.com (Postfix) with ESMTP id CC8DF100003 for ; Mon, 26 Jun 2023 14:15:30 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=XaIwyLnR; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf14.hostedemail.com: domain of kai.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=kai.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687788931; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MkErSugKnWS2N0/o09HmyE4rpBN/fowCr/F/USFuzik=; b=kxTJXbyRBw0aLJ3LoxVJ5IX8H/55XJ5YzxtMkXlEW0MSX5LcvCF4iO+utq5pmkq4UL5csO lK2dClM3ccRgRe7Ei651fbnXTalfxgRxnyuSa3jaCBWGYIyKP8O93bJrFQr479fzonv69I NdFn7S4RSySJJmm/27D4RD5izZrQih8= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=XaIwyLnR; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf14.hostedemail.com: domain of kai.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=kai.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687788931; a=rsa-sha256; cv=none; b=b2nupmqvsCKGVPG093UOZ8rifGUaj8Lq2M8PTv7H3bc9mYoRekVcn9VNtTqcnzvhzU20Ub ASgqQSsoyVJmHwUlZXjIemRSJCc0cS6Zuv2L8fCm9g5P0pC20GZdt2QqxUZf309U6Af/Ft dwygU/yuLMEmXFEFthtFuVWspdydsmc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788930; x=1719324930; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=3C4Jq8CVU11zkfXyA1z2AnjnK1Ws55t2+b6f4x+cwbw=; b=XaIwyLnRLTQFwuOlleXU98sUt6LcCbEdcOMKlJxhLAzEquAbiqxmYsif SuC56mK6sY1FRwMYpuFsGRSpB3uN3IwfEp4hmpCg9iGPXnitZS9SmHC+/ BhMltJTGHxtC22PDSk5Ejj2/nuWR4sl9tYkIkBtc4YGvw9LvM+qxeLFJF cpK0NFRUpNRGjrmpJcHDzUrv/f4R8MlcFYxxCfJBxge+aRXIfhRjktgmV oXKaOVrb0tQZigIhbhTJvXDfaw0+5pgM9TiPxmNoci9jx3AuSw3vrkbzI 7RQan4gEBE9ZYXl3NZGKkqX85Cpb4x0560KoQ6JxVoORnYRMCiQtqUVza Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346034079" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346034079" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:29 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292445" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292445" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:23 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 17/22] x86/kexec: Flush cache of TDX private memory Date: Tue, 27 Jun 2023 02:12:47 +1200 Message-Id: <92fae16d43128e7196f04db5ed71935f3e5ea32e.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Queue-Id: CC8DF100003 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: usd695uxqj4hh3ord7aes1nuoi843zgm X-HE-Tag: 1687788930-376101 X-HE-Meta: U2FsdGVkX18pIBetSNOWM/thMFjtE3UjJ7zq8+ebEvMsG3NAfET//BYm6oRyXjayvXWL9jSia4B1BGp1RLO/HVOlusQPmbLkYSPFRwVqFwM/SPIgad78k25IBJbyIcG7pONWdO92cgPwki7C4NzCgdoT86PFaC1mAGg9griYBFbTXlyrSSAUjOiiE1X8Xab2PPfXj0Y2aHz/0Z9ZvQjqh1ij3K1R4s6vk4YwQ6utRmordzpgfuDEW0EnKcWTLKCQjRWxt1Wnmc4tMThZTfxVUvOE2cHxN3ec1XznaiOn23aPhj95GJIgloMuIBz7+HBejH/o6zzVvpk/uOSslp8x3ZfTgEtVMd4T8qFWmUNgYvZrB/LSPLi5Wd7zkdkvMpxDlxwfJf7iSU6M4LDZnnuZAWBzI805p9L6Z4UyAn9gWesExjQ11UhXNwjvDNOYM8On5yOmxREmLmjKqOrv3Fou+9caMb2WT6jty8QE/4f0rfHNTLklj78VfG6XewrjAamj0jy4bmsBQqsZK3WeoI7X+44BoA0RbUtJKJPbLRR6RXc6kz/O3/X31yQjdOjEm+TNXm7XBfgc7HdMRBtjl5Z/8MaB9XPtLMOUV99tkG/H/vq/CjsKOGBOk2a5corozAmQ99TFk7e/BwKrZAb5o/illHZFxwMk/Mz5JV0c3L1/PmosSWQcXcfyJuGL+lh+qW31SMjmSukK62XEgjjw1+2ycfPJeKbI8bDjzzDqNvuLq1cl27TQPlezjFPnGmaBYyzKa4OOGDEf4VKCNFfpzRGuxO30BkGGGr2FQAFU9PKtF/AYrFZcGMW4HNDIfaYJWieHLLrVC2YDH6ynAfghHnirL1ftTppPijX3eIQi11W1Dl8EkcZ+Uzas+139lA9zF1urN7IYlf7z81Ae7lAsZXxIfA+HlFhpF9N0uxpf/5q2uQLTMbcjal/kCom0y03CnYisZBX3HWBd3lP5xhDelua xSoB5qvU C9m8BdlAXnZyRB0EUYFEsj+5bhhjMIPYEBEkVtub3fvogCAPNR0+14oPrtMiqkNGJWq8Q02JDmSoJk5Z3HB2FdswEN99SxsV/DjgFawqM3hqcu+aJZ2Vt7Gst/aRGRDeoyEtxe14WJpg2liYKHEOH+bUDDjHB4aDI1SEo/ufj7eNkY1Nsuc2IlkKHCOjALBoFMPzb X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: There are two problems in terms of using kexec() to boot to a new kernel when the old kernel has enabled TDX: 1) Part of the memory pages are still TDX private pages; 2) There might be dirty cachelines associated with TDX private pages. The first problem doesn't matter on the platforms w/o the "partial write machine check" erratum. KeyID 0 doesn't have integrity check. If the new kernel wants to use any non-zero KeyID, it needs to convert the memory to that KeyID and such conversion would work from any KeyID. However the old kernel needs to guarantee there's no dirty cacheline left behind before booting to the new kernel to avoid silent corruption from later cacheline writeback (Intel hardware doesn't guarantee cache coherency across different KeyIDs). There are two things that the old kernel needs to do to achieve that: 1) Stop accessing TDX private memory mappings: a. Stop making TDX module SEAMCALLs (TDX global KeyID); b. Stop TDX guests from running (per-guest TDX KeyID). 2) Flush any cachelines from previous TDX private KeyID writes. For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME support. And in this way 1) happens for free as there's no TDX activity between wbinvd() and the native_halt(). Flushing cache in stop_this_cpu() only flushes cache on remote cpus. On the rebooting cpu which does kexec(), unlike SME which does the cache flush in relocate_kernel(), flush the cache right after stopping remote cpus in machine_shutdown(). There are two reasons to do so: 1) For TDX there's no need to defer cache flush to relocate_kernel() because all TDX activities have been stopped. 2) On the platforms with the above erratum the kernel must convert all TDX private pages back to normal before booting to the new kernel in kexec(), and flushing cache early allows the kernel to convert memory early rather than having to muck with the relocate_kernel() assembly. Theoretically, cache flush is only needed when the TDX module has been initialized. However initializing the TDX module is done on demand at runtime, and it takes a mutex to read the module status. Just check whether TDX is enabled by the BIOS instead to flush cache. Signed-off-by: Kai Huang Reviewed-by: Isaku Yamahata Reviewed-by: Kirill A. Shutemov --- v11 -> v12: - Changed comment/changelog to say kernel doesn't try to handle fast warm reset but depends on BIOS to enable workaround (Kirill) - Added Kirill's tag v10 -> v11: - Fixed a bug that cache for rebooting cpu isn't flushed for TDX private memory. - Updated changelog accordingly. v9 -> v10: - No change. v8 -> v9: - Various changelog enhancement and fix (Dave). - Improved comment (Dave). v7 -> v8: - Changelog: - Removed "leave TDX module open" part due to shut down patch has been removed. v6 -> v7: - Improved changelog to explain why don't convert TDX private pages back to normal. --- arch/x86/kernel/process.c | 7 ++++++- arch/x86/kernel/reboot.c | 15 +++++++++++++++ 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index dac41a0072ea..0ce66deb9bc8 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -780,8 +780,13 @@ void __noreturn stop_this_cpu(void *dummy) * * Test the CPUID bit directly because the machine might've cleared * X86_FEATURE_SME due to cmdline options. + * + * The TDX module or guests might have left dirty cachelines + * behind. Flush them to avoid corruption from later writeback. + * Note that this flushes on all systems where TDX is possible, + * but does not actually check that TDX was in use. */ - if (cpuid_eax(0x8000001f) & BIT(0)) + if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled()) native_wbinvd(); for (;;) { /* diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index 3adbe97015c1..ae7480a213a6 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -32,6 +32,7 @@ #include #include #include +#include /* * Power off function, if any @@ -695,6 +696,20 @@ void native_machine_shutdown(void) local_irq_disable(); stop_other_cpus(); #endif + /* + * stop_other_cpus() has flushed all dirty cachelines of TDX + * private memory on remote cpus. Unlike SME, which does the + * cache flush on _this_ cpu in the relocate_kernel(), flush + * the cache for _this_ cpu here. This is because on the + * platforms with "partial write machine check" erratum the + * kernel needs to convert all TDX private pages back to normal + * before booting to the new kernel in kexec(), and the cache + * flush must be done before that. If the kernel took SME's way, + * it would have to muck with the relocate_kernel() assembly to + * do memory conversion. + */ + if (platform_tdx_enabled()) + native_wbinvd(); lapic_shutdown(); restore_boot_irq_mode();