[v12,12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

The TDX module uses additional metadata to record things like which
guest "owns" a given page of memory.  This metadata, referred as
Physical Address Metadata Table (PAMT), essentially serves as the
'struct page' for the TDX module.  PAMTs are not reserved by hardware
up front.  They must be allocated by the kernel and then given to the
TDX module during module initialization.

TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
be a physically contiguous area from a Convertible Memory Region (CMR).
However, the PAMTs which track pages in one TDMR do not need to reside
within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
any TDMR, the overlapping part must be reported as a reserved area in
that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).
The downside is alloc_contig_pages() may fail at runtime.  One (bad)
mitigation is to launch a TDX guest early during system boot to get
those PAMTs allocated at early time, but the only way to fix is to add a
boot option to allocate or reserve PAMTs during kernel boot.

It is imperfect but will be improved on later.

TDX only supports a limited number of reserved areas per TDMR to cover
both PAMTs and memory holes within the given TDMR.  If many PAMTs are
allocated within a single TDMR, the reserved areas may not be sufficient
to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

  - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
    the total number of reserved areas consumed for PAMTs.
  - Try to first allocate PAMT from the local node of the TDMR for better
    NUMA locality.

Also dump out how many pages are allocated for PAMTs when the TDX module
is initialized successfully.  This helps answer the eternal "where did
all my memory go?" questions.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---

v11 -> v12:
 - Moved TDX_PS_NUM from tdx.c to <asm/tdx.h> (Kirill)
 - "<= TDX_PS_1G" -> "< TDX_PS_NUM" (Kirill)
 - Changed tdmr_get_pamt() to return base and size instead of base_pfn
   and npages and related code directly (Dave).
 - Simplified PAMT kb counting. (Dave)
 - tdmrs_count_pamt_pages() -> tdmr_count_pamt_kb() (Kirill/Dave)

v10 -> v11:
 - No update

v9 -> v10:
 - Removed code change in disable_tdx_module() as it doesn't exist
   anymore.

v8 -> v9:
 - Added TDX_PS_NR macro instead of open-coding (Dave).
 - Better alignment of 'pamt_entry_size' in tdmr_set_up_pamt() (Dave).
 - Changed to print out PAMTs in "KBs" instead of "pages" (Dave).
 - Added Dave's Reviewed-by.

v7 -> v8: (Dave)
 - Changelog:
  - Added a sentence to state PAMT allocation will be improved.
  - Others suggested by Dave.
 - Moved 'nid' of 'struct tdx_memblock' to this patch.
 - Improved comments around tdmr_get_nid().
 - WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid().
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - Changes due to using macros instead of 'enum' for TDX supported page
   sizes.

v5 -> v6:
 - Rebase due to using 'tdx_memblock' instead of memblock.
 - 'int pamt_entry_nr' -> 'unsigned long nr_pamt_entries' (Dave/Sagis).
 - Improved comment around tdmr_get_nid() (Dave).
 - Improved comment in tdmr_set_up_pamt() around breaking the PAMT
   into PAMTs for 4K/2M/1G (Dave).
 - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).   

- v3 -> v5 (no feedback on v4):
 - Used memblock to get the NUMA node for given TDMR.
 - Removed tdmr_get_pamt_sz() helper but use open-code instead.
 - Changed to use 'switch .. case..' for each TDX supported page size in
   tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
 - Added printing out memory used for PAMT allocation when TDX module is
   initialized successfully.
 - Explained downside of alloc_contig_pages() in changelog.
 - Addressed other minor comments.

---
 arch/x86/Kconfig            |   1 +
 arch/x86/include/asm/tdx.h  |   1 +
 arch/x86/virt/vmx/tdx/tdx.c | 215 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |   1 +
 4 files changed, 213 insertions(+), 5 deletions(-)

Message ID	85ea233226ec7a05e8c5627a499e97ea4cbd6950.1687784645.git.kai.huang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1000EC0015E for <kvm@archiver.kernel.org>; Mon, 26 Jun 2023 14:16:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231432AbjFZOQE (ORCPT <rfc822;kvm@archiver.kernel.org>); Mon, 26 Jun 2023 10:16:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35886 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230159AbjFZOPg (ORCPT <rfc822;kvm@vger.kernel.org>); Mon, 26 Jun 2023 10:15:36 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 88DE51727; Mon, 26 Jun 2023 07:15:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788908; x=1719324908; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=CJCYWPgwh1KbGUfRPHdaovtkQz/KdxfPG8N6GA0833Y=; b=LpDcgoIEt6D0yRigWdCNnULc6vq9x9wagkdxWqJJNYzPc9gV375XJfOo /rjvzv+M/UkcXwvOAgTVZs/ifX5ePe2v9ZbQVMKTsGFvyZG0/WKMK02MO Gm1bsnF5uT8dWf1LhXBTDBhP2CombMZBk+jNsCyA1RECEfRHa9oCgV0K5 bec0nkrW/iKYrMcEIj2KO+gej1eRRIZPVtGmMiylUVFqphkW0/qT3Xqcd 3TFNF027Z+dNH/bXzJ6cayWi8ucmCrr0UHZZSRMeen9Pr/y66k+Xw1Bt2 l53erz4g7wsn0AbSfvuI930jZ2Bz3ORNrKagpdffcI2L6RKiaIpf1Xdpb w==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033873" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033873" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:56 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292365" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292365" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:47 -0700 From: Kai Huang <kai.huang@intel.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Date: Tue, 27 Jun 2023 02:12:42 +1200 Message-Id: <85ea233226ec7a05e8c5627a499e97ea4cbd6950.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <cover.1687784645.git.kai.huang@intel.com> References: <cover.1687784645.git.kai.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <kvm.vger.kernel.org> X-Mailing-List: kvm@vger.kernel.org
Series	TDX host kernel support \| expand [v12,00/22] TDX host kernel support [v12,01/22] x86/tdx: Define TDX supported page sizes as macros [v12,02/22] x86/virt/tdx: Detect TDX during kernel boot [v12,03/22] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC [v12,04/22] x86/cpu: Detect TDX partial write machine check erratum [v12,05/22] x86/virt/tdx: Add SEAMCALL infrastructure [v12,06/22] x86/virt/tdx: Handle SEAMCALL running out of entropy error [v12,07/22] x86/virt/tdx: Add skeleton to enable TDX on demand [v12,08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory [v12,09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory [v12,10/22] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions [v12,11/22] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions [v12,12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs [v12,13/22] x86/virt/tdx: Designate reserved areas for all TDMRs [v12,14/22] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID [v12,15/22] x86/virt/tdx: Configure global KeyID on all packages [v12,16/22] x86/virt/tdx: Initialize all TDMRs [v12,17/22] x86/kexec: Flush cache of TDX private memory [v12,18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful [v12,19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum [v12,20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP [v12,21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum [v12,22/22] Documentation/x86: Add documentation for TDX host support

[v12,12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

Commit Message

Comments

Patch