mbox series

[v10,00/16] TDX host kernel support

Message ID cover.1678111292.git.kai.huang@intel.com (mailing list archive)
Headers show
Series TDX host kernel support | expand

Message

Huang, Kai March 6, 2023, 2:13 p.m. UTC
Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  TDX specs are available in [1].

This series is the initial support to enable TDX with minimal code to
allow KVM to create and run TDX guests.  KVM support for TDX is being
developed separately[2].  A new "userspace inaccessible memfd" approach
to support TDX private memory is also being developed[3].  The KVM will
only support the new "userspace inaccessible memfd" as TDX guest memory.

This series doesn't aim to support all functionalities, and doesn't aim
to resolve all things perfectly.  For example, memory hotplug is handled
in simple way (please refer to "Kernel policy on TDX memory" and "Memory
hotplug" sections below).

(For memory hotplug, sorry for broadcasting widely but I cc'ed the
linux-mm@kvack.org following Kirill's suggestion so MM experts can also
help to provide comments.)

And TDX module metadata allocation just uses alloc_contig_pages() to
allocate large chunk at runtime, thus it can fail.  It is imperfect now
but _will_ be improved in the future.

Also, the patch to add the new kernel comline tdx="force" isn't included
in this initial version, as Dave suggested it isn't mandatory.  But I
_will_ add one once this initial version gets merged.

All other optimizations will be posted as follow-up once this initial
TDX support is upstreamed.

Hi Dave, Peter, Thomas, Dan (and Intel reviewers),

The environment to test the new LP.INIT SEAMCALL behaviour hasn't been
done yet, thus I haven't tested the new behaviour.  Instead, I tested
with all cpus are online when initializing the TDX module.  CPU hotplug
path isn't really tested although I did some basic test that I can
offline some cpus after module initialization, online them again and the
LP.INIT was skipped successfully for them.

However I believe there should be no issue when the new module is ready.
I will test and report back when the new module is ready.

I would appreciate if folks could review this presumptive series anyway.
   
And I would appreciate reviewed-by or acked-by tags if the patches look
good to you.

----- Changelog history: ------

- v9 -> v10:

 - Changed the per-cpu initalization handling
   - Gave up "ensuring all online cpus are TDX-runnable when TDX module
     is initialized", but just provide two basic functions, tdx_enable()
     and tdx_cpu_enable(), to let the user of TDX to make sure the
     tdx_cpu_enable() has been done successfully when the user wants to
     use particular cpu for TDX.
   - Thus, moved per-cpu initialization out of tdx_enable().  Now
     tdx_enable() just assumes VMXON and tdx_cpu_enable() has been done
     on all online cpus before calling it.
   - Merged the tdx_enable() skeleton patch and per-cpu initialization
     patch together to tell better story.
   - Moved "SEAMCALL infrastructure" patch before the tdx_enable() patch.

 v9: https://lore.kernel.org/lkml/cover.1676286526.git.kai.huang@intel.com/

- v8 -> v9:

 - Added patches to handle TDH.SYS.INIT and TDH.SYS.LP.INIT back.
 - Other changes please refer to changelog histroy in individual patches.

 v8: https://lore.kernel.org/lkml/cover.1670566861.git.kai.huang@intel.com/

- v7 -> v8:

 - 200+ LOC removed (from 1800+ -> 1600+).
 - Removed patches to do TDH.SYS.INIT and TDH.SYS.LP.INIT
   (Dave/Peter/Thomas).
 - Removed patch to shut down TDX module (Sean).
 - For memory hotplug, changed to reject non-TDX memory from
   arch_add_memory() to memory_notifier (Dan/David).
 - Simplified the "skeletion patch" as a result of removing
   TDH.SYS.LP.INIT patch.
 - Refined changelog/comments for most of the patches (to tell better
   story, remove silly comments, etc) (Dave).
 - Added new 'struct tdmr_info_list' struct, and changed all TDMR related
   patches to use it (Dave).
 - Effectively merged patch "Reserve TDX module global KeyID" and
   "Configure TDX module with TDMRs and global KeyID", and removed the
   static variable 'tdx_global_keyid', following Dave's suggestion on
   making tdx_sysinfo local variable.
 - For detailed changes please see individual patch changelog history.

 v7: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- v6 -> v7:
  - Added memory hotplug support.
  - Changed how to choose the list of "TDX-usable" memory regions from at
    kernel boot time to TDX module initialization time.
  - Addressed comments received in previous versions. (Andi/Dave).
  - Improved the commit message and the comments of kexec() support patch,
    and the patch handles returnning PAMTs back to the kernel when TDX
    module initialization fails. Please also see "kexec()" section below.
  - Changed the documentation patch accordingly.
  - For all others please see individual patch changelog history.

 v6: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- v5 -> v6:

  - Removed ACPI CPU/memory hotplug patches. (Intel internal discussion)
  - Removed patch to disable driver-managed memory hotplug (Intel
    internal discussion).
  - Added one patch to introduce enum type for TDX supported page size
    level to replace the hard-coded values in TDX guest code (Dave).
  - Added one patch to make TDX depends on X2APIC being enabled (Dave).
  - Added one patch to build all boot-time present memory regions as TDX
    memory during kernel boot.
  - Added Reviewed-by from others to some patches.
  - For all others please see individual patch changelog history.

 v5: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- v4 -> v5:

  This is essentially a resent of v4.  Sorry I forgot to consult
  get_maintainer.pl when sending out v4, so I forgot to add linux-acpi
  and linux-mm mailing list and the relevant people for 4 new patches.

  There are also very minor code and commit message update from v4:

  - Rebased to latest tip/x86/tdx.
  - Fixed a checkpatch issue that I missed in v4.
  - Removed an obsoleted comment that I missed in patch 6.
  - Very minor update to the commit message of patch 12.

  For other changes to individual patches since v3, please refer to the
  changelog histroy of individual patches (I just used v3 -> v5 since
  there's basically no code change to v4).

 v4: https://lore.kernel.org/lkml/98c84c31d8f062a0b50a69ef4d3188bc259f2af2.1654025431.git.kai.huang@intel.com/T/

- v3 -> v4 (addressed Dave's comments, and other comments from others):

 - Simplified SEAMRR and TDX keyID detection.
 - Added patches to handle ACPI CPU hotplug.
 - Added patches to handle ACPI memory hotplug and driver managed memory
   hotplug.
 - Removed tdx_detect() but only use single tdx_init().
 - Removed detecting TDX module via P-SEAMLDR.
 - Changed from using e820 to using memblock to convert system RAM to TDX
   memory.
 - Excluded legacy PMEM from TDX memory.
 - Removed the boot-time command line to disable TDX patch.
 - Addressed comments for other individual patches (please see individual
   patches).
 - Improved the documentation patch based on the new implementation.

 v3: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- V2 -> v3:

 - Addressed comments from Isaku.
  - Fixed memory leak and unnecessary function argument in the patch to
    configure the key for the global keyid (patch 17).
  - Enhanced a little bit to the patch to get TDX module and CMR
    information (patch 09).
  - Fixed an unintended change in the patch to allocate PAMT (patch 13).
 - Addressed comments from Kevin:
  - Slightly improvement on commit message to patch 03.
 - Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
   seamrr_enabled() (patch 04).
 - Changed documentation patch to add TDX host kernel support materials
   to Documentation/x86/tdx.rst together with TDX guest staff, instead
   of a standalone file (patch 21)
 - Very minor improvement in commit messages.

 v2: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- RFC (v1) -> v2:
  - Rebased to Kirill's latest TDX guest code.
  - Fixed two issues that are related to finding all RAM memory regions
    based on e820.
  - Minor improvement on comments and commit messages.

 v1: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

== Background ==

TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
A CPU-attested software module called 'the TDX module' runs in the new
isolated region as a trusted hypervisor to create/run protected VMs.

TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
as TDX private KeyIDs, which are only accessible within the SEAM mode.

TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
secure processor to provide crypto-protection.  The firmware runs on the
secure processor acts a similar role as the TDX module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.

Before being able to manage TD guests, the TDX module must be loaded
and properly initialized.  This series assumes the TDX module is loaded
by BIOS before the kernel boots.

How to initialize the TDX module is described at TDX module 1.0
specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
Initialization and Shutdown".

== Design Considerations ==

1. Initialize the TDX module at runtime

There are basically two ways the TDX module could be initialized: either
in early boot, or at runtime before the first TDX guest is run.  This
series implements the runtime initialization.

This series adds a function tdx_enable() to allow the caller to initialize
TDX at runtime:

        if (tdx_enable())
                goto no_tdx;
	// TDX is ready to create TD guests.

This approach has below pros:

1) Initializing the TDX module requires to reserve ~1/256th system RAM as
metadata.  Enabling TDX on demand allows only to consume this memory when
TDX is truly needed (i.e. when KVM wants to create TD guests).

2) SEAMCALL requires CPU being already in VMX operation (VMXON has been
done).  So far, KVM is the only user of TDX, and it already handles VMXON.
Letting KVM to initialize TDX avoids handling VMXON in the core kernel.

3) It is more flexible to support "TDX module runtime update" (not in
this series).  After updating to the new module at runtime, kernel needs
to go through the initialization process again.

2. CPU hotplug

TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT)
must be done on one cpu before any other SEAMCALLs can be made on that
cpu, including those involved during the module initialization.

The kernel provides tdx_cpu_enable() to let the user of TDX to do it when
the user wants to use a new cpu for TDX task.

TDX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
event to the kernel.  This series doesn't handle physical (ACPI) CPU
hotplug at all but depends on the BIOS to behave correctly.

Note TDX works with CPU logical online/offline, thus this series still
allows to do logical CPU online/offline.

3. Kernel policy on TDX memory

The TDX module reports a list of "Convertible Memory Region" (CMR) to
indicate which memory regions are TDX-capable.  The TDX architecture
allows the VMM to designate specific convertible memory regions as usable
for TDX private memory.

The initial support of TDX guests will only allocate TDX private memory
from the global page allocator.  This series chooses to designate _all_
system RAM in the core-mm at the time of initializing TDX module as TDX
memory to guarantee all pages in the page allocator are TDX pages.

4. Memory Hotplug

After the kernel passes all "TDX-usable" memory regions to the TDX
module, the set of "TDX-usable" memory regions are fixed during module's
runtime.  No more "TDX-usable" memory can be added to the TDX module
after that.

To achieve above "to guarantee all pages in the page allocator are TDX
pages", this series simply choose to reject any non-TDX-usable memory in
memory hotplug.

This _will_ be enhanced in the future after first submission.

A better solution, suggested by Kirill, is similar to the per-node memory
encryption flag in this series [4].  We can allow adding/onlining non-TDX
memory to separate NUMA nodes so that both "TDX-capable" nodes and
"TDX-capable" nodes can co-exist.  The new TDX flag can be exposed to
userspace via /sysfs so userspace can bind TDX guests to "TDX-capable"
nodes via NUMA ABIs.

5. Physical Memory Hotplug

Note TDX assumes convertible memory is always physically present during
machine's runtime.  A non-buggy BIOS should never support hot-removal of
any convertible memory.  This implementation doesn't handle ACPI memory
removal but depends on the BIOS to behave correctly.

6. Kexec()

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages; 2) There might be dirty cachelines associated
with TDX private pages.

The first problem doesn't matter.  KeyID 0 doesn't have integrity check.
Even the new kernel wants to use any non-zero KeyID, it needs to convert
the memory to that KeyID and such conversion would work from any KeyID.

However the old kernel needs to guarantee there's no dirty cacheline
left behind before booting to the new kernel to avoid silent corruption
from later cacheline writeback (Intel hardware doesn't guarantee cache
coherency across different KeyIDs).

This series just uses wbinvd() to flush cache in stop_this_cpu()
following AMD's SME.



Kai Huang (16):
  x86/tdx: Define TDX supported page sizes as macros
  x86/virt/tdx: Detect TDX during kernel boot
  x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
  x86/virt/tdx: Add SEAMCALL infrastructure
  x86/virt/tdx: Add skeleton to enable TDX on demand
  x86/virt/tdx: Get information about TDX module and TDX-capable memory
  x86/virt/tdx: Use all system memory when initializing TDX module as
    TDX memory
  x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
    memory regions
  x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
  x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  x86/virt/tdx: Designate reserved areas for all TDMRs
  x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
  x86/virt/tdx: Configure global KeyID on all packages
  x86/virt/tdx: Initialize all TDMRs
  x86/virt/tdx: Flush cache in kexec() when TDX is enabled
  Documentation/x86: Add documentation for TDX host support

 Documentation/x86/tdx.rst        |  186 ++++-
 arch/x86/Kconfig                 |   15 +
 arch/x86/Makefile                |    2 +
 arch/x86/coco/tdx/tdx.c          |    6 +-
 arch/x86/include/asm/msr-index.h |    3 +
 arch/x86/include/asm/tdx.h       |   21 +
 arch/x86/kernel/process.c        |    7 +-
 arch/x86/kernel/setup.c          |    2 +
 arch/x86/virt/Makefile           |    2 +
 arch/x86/virt/vmx/Makefile       |    2 +
 arch/x86/virt/vmx/tdx/Makefile   |    2 +
 arch/x86/virt/vmx/tdx/seamcall.S |   52 ++
 arch/x86/virt/vmx/tdx/tdx.c      | 1324 ++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      |  150 ++++
 arch/x86/virt/vmx/tdx/tdxcall.S  |   19 +-
 15 files changed, 1776 insertions(+), 17 deletions(-)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h


base-commit: 1e70c680375aa33cca97bff0bca68c0f82f5023c

Comments

Isaku Yamahata March 8, 2023, 1:11 a.m. UTC | #1
On Tue, Mar 07, 2023 at 03:13:45AM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks.  TDX specs are available in [1].
> 
> This series is the initial support to enable TDX with minimal code to
> allow KVM to create and run TDX guests.  KVM support for TDX is being
> developed separately[2].  A new "userspace inaccessible memfd" approach
> to support TDX private memory is also being developed[3].  The KVM will
> only support the new "userspace inaccessible memfd" as TDX guest memory.
> 
> This series doesn't aim to support all functionalities, and doesn't aim
> to resolve all things perfectly.  For example, memory hotplug is handled
> in simple way (please refer to "Kernel policy on TDX memory" and "Memory
> hotplug" sections below).
> 
> (For memory hotplug, sorry for broadcasting widely but I cc'ed the
> linux-mm@kvack.org following Kirill's suggestion so MM experts can also
> help to provide comments.)
> 
> And TDX module metadata allocation just uses alloc_contig_pages() to
> allocate large chunk at runtime, thus it can fail.  It is imperfect now
> but _will_ be improved in the future.
> 
> Also, the patch to add the new kernel comline tdx="force" isn't included
> in this initial version, as Dave suggested it isn't mandatory.  But I
> _will_ add one once this initial version gets merged.
> 
> All other optimizations will be posted as follow-up once this initial
> TDX support is upstreamed.
> 
> Hi Dave, Peter, Thomas, Dan (and Intel reviewers),
> 
> The environment to test the new LP.INIT SEAMCALL behaviour hasn't been
> done yet, thus I haven't tested the new behaviour.  Instead, I tested
> with all cpus are online when initializing the TDX module.  CPU hotplug
> path isn't really tested although I did some basic test that I can
> offline some cpus after module initialization, online them again and the
> LP.INIT was skipped successfully for them.
> 
> However I believe there should be no issue when the new module is ready.
> I will test and report back when the new module is ready.
> 
> I would appreciate if folks could review this presumptive series anyway.
>    
> And I would appreciate reviewed-by or acked-by tags if the patches look
> good to you.
> 
> ----- Changelog history: ------
> 
> - v9 -> v10:
> 
>  - Changed the per-cpu initalization handling
>    - Gave up "ensuring all online cpus are TDX-runnable when TDX module
>      is initialized", but just provide two basic functions, tdx_enable()
>      and tdx_cpu_enable(), to let the user of TDX to make sure the
>      tdx_cpu_enable() has been done successfully when the user wants to
>      use particular cpu for TDX.
>    - Thus, moved per-cpu initialization out of tdx_enable().  Now
>      tdx_enable() just assumes VMXON and tdx_cpu_enable() has been done
>      on all online cpus before calling it.
>    - Merged the tdx_enable() skeleton patch and per-cpu initialization
>      patch together to tell better story.
>    - Moved "SEAMCALL infrastructure" patch before the tdx_enable() patch.
> 
>  v9: https://lore.kernel.org/lkml/cover.1676286526.git.kai.huang@intel.com/
> 
> - v8 -> v9:
> 
>  - Added patches to handle TDH.SYS.INIT and TDH.SYS.LP.INIT back.
>  - Other changes please refer to changelog histroy in individual patches.

I've rebased my TDX KVM patches to this patch series and updated initialization.
With all LPs online with the existing TDX module and I did cpu online/offline
while TD running.

Test-by: Isaku Yamahata <isaku.yamahata@intel.com>
David Hildenbrand March 16, 2023, 12:35 p.m. UTC | #2
On 06.03.23 15:13, Kai Huang wrote:
> Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks.  TDX specs are available in [1].

I'm afraid there is no [1], probably got lost while resending :)

> 
> This series is the initial support to enable TDX with minimal code to
> allow KVM to create and run TDX guests.  KVM support for TDX is being
> developed separately[2].  A new "userspace inaccessible memfd" approach
> to support TDX private memory is also being developed[3].  The KVM will
> only support the new "userspace inaccessible memfd" as TDX guest memory.

Same with [2].

> 
> This series doesn't aim to support all functionalities, and doesn't aim
> to resolve all things perfectly.  For example, memory hotplug is handled
> in simple way (please refer to "Kernel policy on TDX memory" and "Memory
> hotplug" sections below).
> 
> (For memory hotplug, sorry for broadcasting widely but I cc'ed the
> linux-mm@kvack.org following Kirill's suggestion so MM experts can also
> help to provide comments.)
> 
> And TDX module metadata allocation just uses alloc_contig_pages() to
> allocate large chunk at runtime, thus it can fail.  It is imperfect now
> but _will_ be improved in the future.

Good enough for now I guess. Reserving it via memblock might be better, 
though.

> 
> Also, the patch to add the new kernel comline tdx="force" isn't included
> in this initial version, as Dave suggested it isn't mandatory.  But I
> _will_ add one once this initial version gets merged.

What would be the main purpose of that option?

> 
> All other optimizations will be posted as follow-up once this initial
> TDX support is upstreamed.
> 


[...]

> == Background ==
> 
> TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
> and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
> A CPU-attested software module called 'the TDX module' runs in the new
> isolated region as a trusted hypervisor to create/run protected VMs.
> 
> TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
> provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
> as TDX private KeyIDs, which are only accessible within the SEAM mode.
> 
> TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
> secure processor to provide crypto-protection.  The firmware runs on the
> secure processor acts a similar role as the TDX module.
> 
> The host kernel communicates with SEAM software via a new SEAMCALL
> instruction.  This is conceptually similar to a guest->host hypercall,
> except it is made from the host to SEAM software instead.
> 
> Before being able to manage TD guests, the TDX module must be loaded
> and properly initialized.  This series assumes the TDX module is loaded
> by BIOS before the kernel boots.
> 
> How to initialize the TDX module is described at TDX module 1.0
> specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
> Initialization and Shutdown".
> 
> == Design Considerations ==
> 
> 1. Initialize the TDX module at runtime
> 
> There are basically two ways the TDX module could be initialized: either
> in early boot, or at runtime before the first TDX guest is run.  This
> series implements the runtime initialization.
> 
> This series adds a function tdx_enable() to allow the caller to initialize
> TDX at runtime:
> 
>          if (tdx_enable())
>                  goto no_tdx;
> 	// TDX is ready to create TD guests.
> 
> This approach has below pros:
> 
> 1) Initializing the TDX module requires to reserve ~1/256th system RAM as
> metadata.  Enabling TDX on demand allows only to consume this memory when
> TDX is truly needed (i.e. when KVM wants to create TD guests).

Let's be clear: nobody is going to run encrypted VMs "out of the blue".

You can expect a certain hypervisor setup to be required, for example, 
enabling it on the cmdline and then allocating that metadata from 
memblock during boot.

IIRC s390x handles it similarly with protected VMs and required metadata.

> 
> 2) SEAMCALL requires CPU being already in VMX operation (VMXON has been
> done).  So far, KVM is the only user of TDX, and it already handles VMXON.
> Letting KVM to initialize TDX avoids handling VMXON in the core kernel.
> 
> 3) It is more flexible to support "TDX module runtime update" (not in
> this series).  After updating to the new module at runtime, kernel needs
> to go through the initialization process again.
> 
> 2. CPU hotplug
> 
> TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT)
> must be done on one cpu before any other SEAMCALLs can be made on that
> cpu, including those involved during the module initialization.
> 
> The kernel provides tdx_cpu_enable() to let the user of TDX to do it when
> the user wants to use a new cpu for TDX task.
> 
> TDX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
> never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
> event to the kernel.  This series doesn't handle physical (ACPI) CPU
> hotplug at all but depends on the BIOS to behave correctly.
> 
> Note TDX works with CPU logical online/offline, thus this series still
> allows to do logical CPU online/offline.
> 
> 3. Kernel policy on TDX memory
> 
> The TDX module reports a list of "Convertible Memory Region" (CMR) to
> indicate which memory regions are TDX-capable.  The TDX architecture
> allows the VMM to designate specific convertible memory regions as usable
> for TDX private memory.
> 
> The initial support of TDX guests will only allocate TDX private memory
> from the global page allocator.  This series chooses to designate _all_
> system RAM in the core-mm at the time of initializing TDX module as TDX
> memory to guarantee all pages in the page allocator are TDX pages.
> 
> 4. Memory Hotplug
> 
> After the kernel passes all "TDX-usable" memory regions to the TDX
> module, the set of "TDX-usable" memory regions are fixed during module's
> runtime.  No more "TDX-usable" memory can be added to the TDX module
> after that.
> 
> To achieve above "to guarantee all pages in the page allocator are TDX
> pages", this series simply choose to reject any non-TDX-usable memory in
> memory hotplug.
> 
> This _will_ be enhanced in the future after first submission.

What's the primary reason to enhance that? Are there reasonable use 
cases? Why would be expect to have other (!TDX capable) memory in the 
system?

> 
> A better solution, suggested by Kirill, is similar to the per-node memory
> encryption flag in this series [4].  We can allow adding/onlining non-TDX
> memory to separate NUMA nodes so that both "TDX-capable" nodes and
> "TDX-capable" nodes can co-exist.  The new TDX flag can be exposed to
> userspace via /sysfs so userspace can bind TDX guests to "TDX-capable"
> nodes via NUMA ABIs.
> 
> 5. Physical Memory Hotplug
> 
> Note TDX assumes convertible memory is always physically present during
> machine's runtime.  A non-buggy BIOS should never support hot-removal of
> any convertible memory.  This implementation doesn't handle ACPI memory
> removal but depends on the BIOS to behave correctly.
Huang, Kai March 16, 2023, 10:06 p.m. UTC | #3
On Thu, 2023-03-16 at 13:35 +0100, David Hildenbrand wrote:
> On 06.03.23 15:13, Kai Huang wrote:
> > Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> > host and certain physical attacks.  TDX specs are available in [1].
> 
> I'm afraid there is no [1], probably got lost while resending :)
> 
> > 
> > This series is the initial support to enable TDX with minimal code to
> > allow KVM to create and run TDX guests.  KVM support for TDX is being
> > developed separately[2].  A new "userspace inaccessible memfd" approach
> > to support TDX private memory is also being developed[3].  The KVM will
> > only support the new "userspace inaccessible memfd" as TDX guest memory.
> 
> Same with [2].

Hi David,

Thanks for your feedback!

Oh sorry, yes indeed they were stripped unintentionally when I was updating the
cover letter.  I added here for your reference:

[1]: TDX specs
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

2]: KVM TDX support
https://lore.kernel.org/lkml/cover.1678643051.git.isaku.yamahata@intel.com/

[3]: KVM: mm: fd-based approach for supporting KVM
https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.intel.com/T/

> 
> > 
> > This series doesn't aim to support all functionalities, and doesn't aim
> > to resolve all things perfectly.  For example, memory hotplug is handled
> > in simple way (please refer to "Kernel policy on TDX memory" and "Memory
> > hotplug" sections below).
> > 
> > (For memory hotplug, sorry for broadcasting widely but I cc'ed the
> > linux-mm@kvack.org following Kirill's suggestion so MM experts can also
> > help to provide comments.)
> > 
> > And TDX module metadata allocation just uses alloc_contig_pages() to
> > allocate large chunk at runtime, thus it can fail.  It is imperfect now
> > but _will_ be improved in the future.
> 
> Good enough for now I guess. Reserving it via memblock might be better, 
> though.
> 
> > 
> > Also, the patch to add the new kernel comline tdx="force" isn't included
> > in this initial version, as Dave suggested it isn't mandatory.  But I
> > _will_ add one once this initial version gets merged.
> 
> What would be the main purpose of that option?

Initializing the TDX module needs to consume non-trivial memory that is given to
the TDX module as metadata to track page status, etc.  Currently, KVM
maintainers want to initialize TDX during KVM module loading time.  This
basically means TDX will get enabled by default even people don't want to use
it.  So Peter wanted to add a kernel boot parameter to disable TDX for all.

> 
> > 
> > All other optimizations will be posted as follow-up once this initial
> > TDX support is upstreamed.
> > 
> 
> 
> [...]
> 
> > == Background ==
> > 
> > TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
> > and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
> > A CPU-attested software module called 'the TDX module' runs in the new
> > isolated region as a trusted hypervisor to create/run protected VMs.
> > 
> > TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
> > provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
> > as TDX private KeyIDs, which are only accessible within the SEAM mode.
> > 
> > TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
> > secure processor to provide crypto-protection.  The firmware runs on the
> > secure processor acts a similar role as the TDX module.
> > 
> > The host kernel communicates with SEAM software via a new SEAMCALL
> > instruction.  This is conceptually similar to a guest->host hypercall,
> > except it is made from the host to SEAM software instead.
> > 
> > Before being able to manage TD guests, the TDX module must be loaded
> > and properly initialized.  This series assumes the TDX module is loaded
> > by BIOS before the kernel boots.
> > 
> > How to initialize the TDX module is described at TDX module 1.0
> > specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
> > Initialization and Shutdown".
> > 
> > == Design Considerations ==
> > 
> > 1. Initialize the TDX module at runtime
> > 
> > There are basically two ways the TDX module could be initialized: either
> > in early boot, or at runtime before the first TDX guest is run.  This
> > series implements the runtime initialization.
> > 
> > This series adds a function tdx_enable() to allow the caller to initialize
> > TDX at runtime:
> > 
> >          if (tdx_enable())
> >                  goto no_tdx;
> > 	// TDX is ready to create TD guests.
> > 
> > This approach has below pros:
> > 
> > 1) Initializing the TDX module requires to reserve ~1/256th system RAM as
> > metadata.  Enabling TDX on demand allows only to consume this memory when
> > TDX is truly needed (i.e. when KVM wants to create TD guests).
> 
> Let's be clear: nobody is going to run encrypted VMs "out of the blue".
> 
> You can expect a certain hypervisor setup to be required, for example, 
> enabling it on the cmdline and then allocating that metadata from 
> memblock during boot.

Yes KVM will also have a parameter to specifically enable TDX.

> 
> IIRC s390x handles it similarly with protected VMs and required metadata.
> 
> > 
> > 2) SEAMCALL requires CPU being already in VMX operation (VMXON has been
> > done).  So far, KVM is the only user of TDX, and it already handles VMXON.
> > Letting KVM to initialize TDX avoids handling VMXON in the core kernel.
> > 
> > 3) It is more flexible to support "TDX module runtime update" (not in
> > this series).  After updating to the new module at runtime, kernel needs
> > to go through the initialization process again.
> > 
> > 2. CPU hotplug
> > 
> > TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT)
> > must be done on one cpu before any other SEAMCALLs can be made on that
> > cpu, including those involved during the module initialization.
> > 
> > The kernel provides tdx_cpu_enable() to let the user of TDX to do it when
> > the user wants to use a new cpu for TDX task.
> > 
> > TDX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
> > never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
> > event to the kernel.  This series doesn't handle physical (ACPI) CPU
> > hotplug at all but depends on the BIOS to behave correctly.
> > 
> > Note TDX works with CPU logical online/offline, thus this series still
> > allows to do logical CPU online/offline.
> > 
> > 3. Kernel policy on TDX memory
> > 
> > The TDX module reports a list of "Convertible Memory Region" (CMR) to
> > indicate which memory regions are TDX-capable.  The TDX architecture
> > allows the VMM to designate specific convertible memory regions as usable
> > for TDX private memory.
> > 
> > The initial support of TDX guests will only allocate TDX private memory
> > from the global page allocator.  This series chooses to designate _all_
> > system RAM in the core-mm at the time of initializing TDX module as TDX
> > memory to guarantee all pages in the page allocator are TDX pages.
> > 
> > 4. Memory Hotplug
> > 
> > After the kernel passes all "TDX-usable" memory regions to the TDX
> > module, the set of "TDX-usable" memory regions are fixed during module's
> > runtime.  No more "TDX-usable" memory can be added to the TDX module
> > after that.
> > 
> > To achieve above "to guarantee all pages in the page allocator are TDX
> > pages", this series simply choose to reject any non-TDX-usable memory in
> > memory hotplug.
> > 
> > This _will_ be enhanced in the future after first submission.
> 
> What's the primary reason to enhance that? Are there reasonable use 
> cases? Why would be expect to have other (!TDX capable) memory in the 
> system?

Basically Kirill preferred this.  Please see below paragraph in my original
cover letter.

But there has been no consensus on whether we should do it especially with
community.  I probably should not use the word _will_ here (also kinda forgot to
keep this section up to date).

I think I'll just remove this and below paragraph entirely, or I will adjust the
words to say it perhaps is an enhancement we can do in the future.

> 
> > 
> > A better solution, suggested by Kirill, is similar to the per-node memory
> > encryption flag in this series [4].  We can allow adding/onlining non-TDX
> > memory to separate NUMA nodes so that both "TDX-capable" nodes and
> > "TDX-capable" nodes can co-exist.  The new TDX flag can be exposed to
> > userspace via /sysfs so userspace can bind TDX guests to "TDX-capable"
> > nodes via NUMA ABIs.

Also [4] was stripped:

[4]: per-node memory encryption flag
https://lore.kernel.org/linux-mm/20221007155323.ue4cdthkilfy4lbd@box.shutemov.name/t/

> > 
> > 5. Physical Memory Hotplug
> > 
> > Note TDX assumes convertible memory is always physically present during
> > machine's runtime.  A non-buggy BIOS should never support hot-removal of
> > any convertible memory.  This implementation doesn't handle ACPI memory
> > removal but depends on the BIOS to behave correctly.
> 
> -- 
> Thanks,
> 
> David / dhildenb
>