From patchwork Mon Feb 13 11:59:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13138307 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0A2BEC636D4 for ; Mon, 13 Feb 2023 12:01:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A24296B007B; Mon, 13 Feb 2023 07:01:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9AD366B007E; Mon, 13 Feb 2023 07:01:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7D91A6B0080; Mon, 13 Feb 2023 07:01:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 6C7D96B007B for ; Mon, 13 Feb 2023 07:01:35 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id E275541765 for ; Mon, 13 Feb 2023 12:01:34 +0000 (UTC) X-FDA: 80462128908.01.C6B13D0 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by imf07.hostedemail.com (Postfix) with ESMTP id 198434002D for ; Mon, 13 Feb 2023 12:01:23 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=RrA1PdrG; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf07.hostedemail.com: domain of kai.huang@intel.com designates 192.55.52.88 as permitted sender) smtp.mailfrom=kai.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676289692; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MbyKg8euqH1Lxnknf+Yf6rHuxUGctzWjIwEvS3YCu90=; b=wydgfn0HRx9RGCTSjbo1UcQV6Vnmp+uQk8bzD1vWV353tEkIh63hCQde4Ahr+0r/i+SieH R0cw1qQ9rcuagXSbbFVeD0G1JcflyMWtt1/TMvVoldLqgapFRsJ9GYUPdd0jnHhX1tPFz1 ohkHDIw5ffX10nBVFdeMgqQdaajG8xE= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=RrA1PdrG; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf07.hostedemail.com: domain of kai.huang@intel.com designates 192.55.52.88 as permitted sender) smtp.mailfrom=kai.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676289692; a=rsa-sha256; cv=none; b=wTJxtF/G/FhKBpMrYcQmpCCei96E3wlFyC+JsqtnxT4mYVALkbMEwN1qrHvoHNedDpTf7X zLMzDKczCYjlxv5w/ZYqHeU6He1E/wCwvbInzA+hm/W/KJqJsvh6McfOvWNbhZH2/31Duv svrh70Ilvh+hiizDZLV3Cy3gMrc3cCE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1676289684; x=1707825684; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HWyQYVqfDv21O43Gac9/BTJItIOh2vDGByh2XxGy4+k=; b=RrA1PdrGDaaqbGAjUMNXeQPgCX6tBu14C4zaWKtB7N3z59QOC9lfk+9Q ynHZCiWzkSXVQcao/7zJ1txwo1DU2JuWo2c3F190fl4pOwYCsdzI2rm3r Gyhn+sxt8/mEZMyrpDR1d18sfZho3e3DyEstlKWRqsGYuI0hnRKUWPvTg aFb0xUHvnS2enAxoFshGY8yLK/arKoWU+7Qavu/C7I5Qe6zljaVd1PVoi 8EcsPR1iBm+cIs5qp/q0JRdnyMy5B/y9PEhWyfFxoUx71vpzBmmt2ZWlJ upQoamXx6DV0+M689R+jez6I7DrB6+bH6pBYgGvgwC7T98aExTG70ARK2 Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10619"; a="358283323" X-IronPort-AV: E=Sophos;i="5.97,293,1669104000"; d="scan'208";a="358283323" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Feb 2023 04:01:00 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10619"; a="701243283" X-IronPort-AV: E=Sophos;i="5.97,293,1669104000"; d="scan'208";a="701243283" Received: from wonger-mobl.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.209.188.34]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Feb 2023 04:00:54 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, david@redhat.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v9 09/18] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Date: Tue, 14 Feb 2023 00:59:16 +1300 Message-Id: X-Mailer: git-send-email 2.39.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 198434002D X-Stat-Signature: b5m71mrqmntdiwck91xogtn7hnmk7dnr X-HE-Tag: 1676289683-244800 X-HE-Meta: U2FsdGVkX18WhGEsyk2e6GS6Ph+vgP+/xyj6MSwXH5dPVupqP3omSMsRDyaR9RMTK2ywubDtuBXufo0PSjBMGh7rIAwQe38xOMp3YZnaWjT/P37hWpU+LYBkESHdezQC5uMDiMQ2Ce+fHYykEgGuB1ttj6KNBY+AkFo3ltixgt7d+MZHy8gn/4uejMJgl8ZMP4SZf5XzyKaYj6nr/Hf9LOyQ6cxWbIBrPPaUVK2TpVHxacQsaWy2HCpW55iS32UpG8Y+zv/jwwcSYD+Ov3Dr7DeWEPTHzf9H992bcGbrpJuV82Lvo4Nb95UV4IsvOMbgIRv09DCnmQWW45cTKO8dAMu4d1f2QX6GHQUH2AoLE8fZRmXtW6sj7PuqTEYXVu34YHtvPS1qYEty6Qqx3ScTn4yOVBBYjXrRlXJCb/ocdje5GWY8cwgorDmrhzBtQnzw2ITEFG9nBoxl2yOyFGQVbZpr8cXjyxx0YMUHIZ2FUinudMAJWzIN3hV3lEc6Ssy6i0JQxPiIt1vWzezNSqHdFmQFRpRp6AP8TsAxgdQ3aXiHT0yFAg5AwiCd6a4oXiEjdj10YV2/2wk7lcDsLcs5wPHZ4S0FRnMChwfXFPusGOf2KS3ldNWrRBqed3Xjp/Z9rAmoRdUp9MMuYD1EJ2jd5w49hEHlfq3jB4Ba+TAkVzA+J3ppi+KG7jiTrXTuzsEagoQw4HtftscxKWd0QOcQrtALwadfSubyuHXglYOaiRgPsorf9nU9VU7wTP7VNUr9bN8yIj+9gCecTGWB5b6956DPwUlY0LwNZdKQKk24biw0uA5KZQqayRnhEl3d20uW9K5GC1wI2Mw2m05Osm3e3w36RZEViIlG5/kwMMj4NH5ekQpTC58Gb8rAkOu9dn5Y6ooLG/LeEVtjMzuxU2vE56yIinfNO8XvMY008qW/Q+wNe6XQoLnvmUfOJ654RgZS9dFmTFHc18zvaHmx6yr l/uWyJZH lMjiZOvOhoRrHNVWvDtLWZC+vXp0aBzrD5IpL+Nll/qVJXC+Rqp3NfCEeTw5rlFVYpjXIueG8jjlXNVMbf/GAppCCc9WVqMxBdZMRSJri318u7jhS2RObTQ5awWdUa2xJzg46oFCU+cHwzeo7F1w4FMcLozKocXXQthCMJjHrkBkuEe0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: As a step of initializing the TDX module, the kernel needs to tell the TDX module which memory regions can be used by the TDX module as TDX guest memory. TDX reports a list of "Convertible Memory Region" (CMR) to tell the kernel which memory is TDX compatible. The kernel needs to build a list of memory regions (out of CMRs) as "TDX-usable" memory and pass them to the TDX module. Once this is done, those "TDX-usable" memory regions are fixed during module's lifetime. To keep things simple, assume that all TDX-protected memory will come from the page allocator. Make sure all pages in the page allocator *are* TDX-usable memory. As TDX-usable memory is a fixed configuration, take a snapshot of the memory configuration from memblocks at the time of module initialization (memblocks are modified on memory hotplug). This snapshot is used to enable TDX support for *this* memory configuration only. Use a memory hotplug notifier to ensure that no other RAM can be added outside of this configuration. This approach requires all memblock memory regions at the time of module initialization to be TDX convertible memory to work, otherwise module initialization will fail in a later SEAMCALL when passing those regions to the module. This approach works when all boot-time "system RAM" are TDX convertible memory, and no non-TDX-convertible memory is hot-added to the core-mm before module initialization. For instance, on the first generation of TDX machines, both CXL memory and NVDIMM are not TDX convertible memory. Using kmem driver to hot-add any CXL memory or NVDIMM to the core-mm before module initialization will result in module fail to initialize. The SEAMCALL error code will be available in the dmesg to help user to understand the failure. Signed-off-by: Kai Huang Reviewed-by: "Huang, Ying" --- v8 -> v9: - Replace "The initial support ..." with timeless sentence in both changelog and comments(Dave). - Fix run-on sentence in changelog, and senstence to explain why to stash off memblock (Dave). - Tried to improve why to choose this approach and how it work in changelog based on Dave's suggestion. - Many other comments enhancement (Dave). v7 -> v8: - Trimed down changelog (Dave). - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series (Ying). - Moved memory hotplug handling from add_arch_memory() to memory_notifier (Dan/David). - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave). - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave). - Removed pfn_covered_by_cmr() check as no code to trim CMRs now. - Improve the comment around first 1MB (Dave). - Added a comment around reserve_real_mode() to point out TDX code relies on first 1MB being reserved (Ying). - Added comment to explain why the new online memory range cannot cross multiple TDX memory blocks (Dave). - Improved other comments (Dave). --- arch/x86/Kconfig | 1 + arch/x86/kernel/setup.c | 2 + arch/x86/virt/vmx/tdx/tdx.c | 159 +++++++++++++++++++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 6 ++ 4 files changed, 167 insertions(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 6dd5d5586099..f23bc540778a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1958,6 +1958,7 @@ config INTEL_TDX_HOST depends on X86_64 depends on KVM_INTEL depends on X86_X2APIC + select ARCH_KEEP_MEMBLOCK help Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. This option enables necessary TDX diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 88188549647c..a8a119a9b48c 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1165,6 +1165,8 @@ void __init setup_arch(char **cmdline_p) * * Moreover, on machines with SandyBridge graphics or in setups that use * crashkernel the entire 1M is reserved anyway. + * + * Note the host kernel TDX also requires the first 1MB being reserved. */ x86_platform.realmode_reserve(); diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index ae8e59294b46..5101b636a9b0 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -15,6 +15,13 @@ #include #include #include +#include +#include +#include +#include +#include +#include +#include #include #include #include @@ -33,6 +40,9 @@ static DEFINE_MUTEX(tdx_module_lock); static cpumask_t __cpu_tdx_mask; static cpumask_t *cpu_tdx_mask = &__cpu_tdx_mask; +/* All TDX-usable memory regions. Protected by mem_hotplug_lock. */ +static LIST_HEAD(tdx_memlist); + /* * Use tdx_global_keyid to indicate that TDX is uninitialized. * This is used in TDX initialization error paths to take it from @@ -71,6 +81,51 @@ static int __init record_keyid_partitioning(u32 *tdx_keyid_start, return 0; } +static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn) +{ + struct tdx_memblock *tmb; + + /* Empty list means TDX isn't enabled. */ + if (list_empty(&tdx_memlist)) + return true; + + /* + * This check assumes that the start_pfn<->end_pfn range does not + * cross multiple @tdx_memlist entries. A single memory online + * event across multiple memblocks (from which @tdx_memlist + * entries are derived at the time of module initialization) is + * not possible. This is because memory offline/online is done + * on granularity of 'struct memory_block', and the hotpluggable + * memory region (one memblock) must be multiple of memory_block. + */ + list_for_each_entry(tmb, &tdx_memlist, list) { + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn) + return true; + } + return false; +} + +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action, + void *v) +{ + struct memory_notify *mn = v; + + if (action != MEM_GOING_ONLINE) + return NOTIFY_OK; + + /* + * The TDX memory configuration is static and can not be + * changed. Reject onlining any memory which is outside of + * the static configuration whether it supports TDX or not. + */ + return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ? + NOTIFY_OK : NOTIFY_BAD; +} + +static struct notifier_block tdx_memory_nb = { + .notifier_call = tdx_memory_notifier, +}; + static int __init tdx_init(void) { u32 tdx_keyid_start, nr_tdx_keyids; @@ -101,6 +156,13 @@ static int __init tdx_init(void) goto no_tdx; } + err = register_memory_notifier(&tdx_memory_nb); + if (err) { + pr_info("initialization failed: register_memory_notifier() failed (%d)\n", + err); + goto no_tdx; + } + tdx_guest_keyid_start = tdx_keyid_start; tdx_nr_guest_keyids = nr_tdx_keyids; @@ -288,6 +350,79 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo, return 0; } +/* + * Add a memory region as a TDX memory block. The caller must make sure + * all memory regions are added in address ascending order and don't + * overlap. + */ +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn, + unsigned long end_pfn) +{ + struct tdx_memblock *tmb; + + tmb = kmalloc(sizeof(*tmb), GFP_KERNEL); + if (!tmb) + return -ENOMEM; + + INIT_LIST_HEAD(&tmb->list); + tmb->start_pfn = start_pfn; + tmb->end_pfn = end_pfn; + + /* @tmb_list is protected by mem_hotplug_lock */ + list_add_tail(&tmb->list, tmb_list); + return 0; +} + +static void free_tdx_memlist(struct list_head *tmb_list) +{ + /* @tmb_list is protected by mem_hotplug_lock */ + while (!list_empty(tmb_list)) { + struct tdx_memblock *tmb = list_first_entry(tmb_list, + struct tdx_memblock, list); + + list_del(&tmb->list); + kfree(tmb); + } +} + +/* + * Ensure that all memblock memory regions are convertible to TDX + * memory. Once this has been established, stash the memblock + * ranges off in a secondary structure because memblock is modified + * in memory hotplug while TDX memory regions are fixed. + */ +static int build_tdx_memlist(struct list_head *tmb_list) +{ + unsigned long start_pfn, end_pfn; + int i, ret; + + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) { + /* + * The first 1MB is not reported as TDX convertible memory. + * Although the first 1MB is always reserved and won't end up + * to the page allocator, it is still in memblock's memory + * regions. Skip them manually to exclude them as TDX memory. + */ + start_pfn = max(start_pfn, PHYS_PFN(SZ_1M)); + if (start_pfn >= end_pfn) + continue; + + /* + * Add the memory regions as TDX memory. The regions in + * memblock has already guaranteed they are in address + * ascending order and don't overlap. + */ + ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn); + if (ret) + goto err; + } + + return 0; +err: + free_tdx_memlist(tmb_list); + return ret; +} + static int init_tdx_module(void) { static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo, @@ -326,10 +461,25 @@ static int init_tdx_module(void) if (ret) goto out; + /* + * To keep things simple, assume that all TDX-protected memory + * will come from the page allocator. Make sure all pages in the + * page allocator are TDX-usable memory. + * + * Build the list of "TDX-usable" memory regions which cover all + * pages in the page allocator to guarantee that. Do it while + * holding mem_hotplug_lock read-lock as the memory hotplug code + * path reads the @tdx_memlist to reject any new memory. + */ + get_online_mems(); + + ret = build_tdx_memlist(&tdx_memlist); + if (ret) + goto out; + /* * TODO: * - * - Build the list of TDX-usable memory regions. * - Construct a list of "TD Memory Regions" (TDMRs) to cover * all TDX-usable memory regions. * - Configure the TDMRs and the global KeyID to the TDX module. @@ -340,6 +490,12 @@ static int init_tdx_module(void) */ ret = -EINVAL; out: + /* + * @tdx_memlist is written here and read at memory hotplug time. + * Lock out memory hotplug code while building it. + */ + put_online_mems(); + /* * Clear @cpu_tdx_mask if module initialization fails before * CPU hotplug is re-enabled. tdx_cpu_online() uses it to check @@ -382,6 +538,7 @@ static void disable_tdx_module(void) * init_tdx_module(). Remove this comment after * all steps are done. */ + free_tdx_memlist(&tdx_memlist); cpumask_clear(cpu_tdx_mask); } diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index e32d9920b3a7..edb1d697347f 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -112,6 +112,12 @@ enum tdx_module_status_t { TDX_MODULE_ERROR }; +struct tdx_memblock { + struct list_head list; + unsigned long start_pfn; + unsigned long end_pfn; +}; + struct tdx_module_output; u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, struct tdx_module_output *out);