diff mbox series

[8/9] x86/virt/tdx: Exclude memory region hole within CMR as TDMR's reserved area

Message ID cfbed1139887416b6fe0d130883dbe210e97d598.1718538552.git.kai.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX host: metadata reading tweaks, bug fix and info dump | expand

Commit Message

Huang, Kai June 16, 2024, 12:01 p.m. UTC
A TDX module initialization failure was reported on a Emerald Rapids
platform:

  virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
  virt/tdx: module initialization failed (-28)

As a step of initializing the TDX module, the kernel tells the TDX
module all the "TDX-usable memory regions" via a set of TDX architecture
defined structure "TD Memory Region" (TDMR).  Each TDMR must be in 1GB
aligned and in 1GB granularity, and all "non-TDX-usable memory holes" in
a given TDMR must be marked as a "reserved area".  Each TDMR only
supports a maximum number of reserved areas reported by the TDX module.

As shown above, the root cause of this failure is when the kernel tries
to construct a TDMR to cover address range [0x0, 0x80000000), there
are too many memory holes within that range and the number of memory
holes exceeds the maximum number of reserved areas.

The E820 table of that platform (see [1] below) reflects this: the
number of memory holes among e820 "usable" entries exceeds 16, which is
the maximum number of reserved areas TDX module supports in practice.

=== Fix ===

There are two options to fix this: 1) put less memory holes as "reserved
area" when constructing a TDMR; 2) reduce the TDMR's size to cover less
memory regions, thus less memory holes.

Option 1) is possible, and in fact is easier and preferable:

TDX actually has a concept of "Convertible Memory Regions" (CMRs).  TDX
reports a list of CMRs that meet TDX's security requirements on memory.
TDX requires all the "TDX-usable memory regions" that the kernel passes
to the module via TDMRs, a.k.a, all the "non-reserved regions in TDMRs",
must be convertible memory.

In other words, if a memory hole is indeed CMR, then it's not mandatory
for the kernel to add it to the reserved areas.  The number of consumed
reserved areas can be reduced if the kernel doesn't add those memory
holes as reserved area.  Note this doesn't have security impact because
the kernel is out of TDX's TCB anyway.

This is feasible because in practice the CMRs just reflect the nature of
whether the RAM can indeed be used by TDX, thus each CMR tends to be a
large range w/o being split into small areas, e.g., in the way the e820
table does to contain a lot "ACPI *" entries.  [2] below shows the CMRs
reported on the problematic platform (using the off-tree TDX code).

So for this particular module initialization failure, the memory holes
that are within [0x0, 0x80000000) are mostly indeed CMR.  By not adding
them to the reserved areas, the number of consumed reserved areas for
the TDMR [0x0, 0x80000000) can be dramatically reduced.

On the other hand, although option 2) is also theoretically feasible, it
requires more complicated logic to handle around splitting TDMR into
smaller ones.  E.g., today one memory region must be fully in one TDMR,
while splitting TDMR will result in each TDMR only covering part of some
memory region.  And this also increases the total number of TDMRs, which
also cannot exceed a maximum value that TDX module supports.

So, fix this issue by:

1) reading out the CMRs from the TDX module global metadata, and
2) passing the CMRs to the function construct_tdmrs(), and changing to
   not add a memory hole to the reserved areas when it is indeed CMR.

Also dump the CMRs in dmesg.  They are helpful when something goes wrong
around "constructing and passing the TDMRs to the TDX module to
configure it".

[1] BIOS-E820 table of the problematic platform:

  BIOS-e820: [mem 0x0000000000000000-0x000000000009efff] usable
  BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
  BIOS-e820: [mem 0x0000000000100000-0x000000005d168fff] usable
  BIOS-e820: [mem 0x000000005d169000-0x000000005d22afff] ACPI data
  BIOS-e820: [mem 0x000000005d22b000-0x000000005d3cefff] usable
  BIOS-e820: [mem 0x000000005d3cf000-0x000000005d469fff] reserved
  BIOS-e820: [mem 0x000000005d46a000-0x000000005e5b2fff] usable
  BIOS-e820: [mem 0x000000005e5b3000-0x000000005e5c2fff] reserved
  BIOS-e820: [mem 0x000000005e5c3000-0x000000005e5d2fff] usable
  BIOS-e820: [mem 0x000000005e5d3000-0x000000005e5e4fff] reserved
  BIOS-e820: [mem 0x000000005e5e5000-0x000000005eb57fff] usable
  BIOS-e820: [mem 0x000000005eb58000-0x0000000061357fff] ACPI NVS
  BIOS-e820: [mem 0x0000000061358000-0x000000006172afff] usable
  BIOS-e820: [mem 0x000000006172b000-0x0000000061794fff] ACPI data
  BIOS-e820: [mem 0x0000000061795000-0x00000000617fefff] usable
  BIOS-e820: [mem 0x00000000617ff000-0x0000000061912fff] ACPI data
  BIOS-e820: [mem 0x0000000061913000-0x0000000061998fff] usable
  BIOS-e820: [mem 0x0000000061999000-0x00000000619dffff] ACPI data
  BIOS-e820: [mem 0x00000000619e0000-0x00000000619e1fff] usable
  BIOS-e820: [mem 0x00000000619e2000-0x00000000619e9fff] reserved
  BIOS-e820: [mem 0x00000000619ea000-0x0000000061a26fff] usable
  BIOS-e820: [mem 0x0000000061a27000-0x0000000061baefff] ACPI data
  BIOS-e820: [mem 0x0000000061baf000-0x00000000623c2fff] usable
  BIOS-e820: [mem 0x00000000623c3000-0x0000000062471fff] reserved
  BIOS-e820: [mem 0x0000000062472000-0x0000000062823fff] usable
  BIOS-e820: [mem 0x0000000062824000-0x0000000063a24fff] reserved
  BIOS-e820: [mem 0x0000000063a25000-0x0000000063d57fff] usable
  BIOS-e820: [mem 0x0000000063d58000-0x0000000064157fff] reserved
  BIOS-e820: [mem 0x0000000064158000-0x0000000064158fff] usable
  BIOS-e820: [mem 0x0000000064159000-0x0000000064194fff] reserved
  BIOS-e820: [mem 0x0000000064195000-0x000000006e9cefff] usable
  BIOS-e820: [mem 0x000000006e9cf000-0x000000006eccefff] reserved
  BIOS-e820: [mem 0x000000006eccf000-0x000000006f6fefff] ACPI NVS
  BIOS-e820: [mem 0x000000006f6ff000-0x000000006f7fefff] ACPI data
  BIOS-e820: [mem 0x000000006f7ff000-0x000000006f7fffff] usable
  BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved
  ......

[2] Convertible Memory Regions of the problematic platform:

  virt/tdx: CMR: [0x100000, 0x6f800000)
  virt/tdx: CMR: [0x100000000, 0x107a000000)
  virt/tdx: CMR: [0x1080000000, 0x207c000000)
  virt/tdx: CMR: [0x2080000000, 0x307c000000)
  virt/tdx: CMR: [0x3080000000, 0x407c000000)

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 149 ++++++++++++++++++++++++++++++++----
 arch/x86/virt/vmx/tdx/tdx.h |  13 ++++
 2 files changed, 146 insertions(+), 16 deletions(-)

Comments

Chao Gao June 17, 2024, 10:54 a.m. UTC | #1
>+/* Return whether a given region [start, end) is a sub-region of any CMR */
>+static bool is_cmr_subregion(struct tdx_sysinfo_cmr_info *cmr_info, u64 start,
>+			    u64 end)
>+{
>+	int i;
>+
>+	for (i = 0; i < cmr_info->num_cmrs; i++) {
>+		u64 cmr_base = cmr_info->cmr_base[i];
>+		u64 cmr_size = cmr_info->cmr_size[i];
>+
>+		if (start >= cmr_base && end <= (cmr_base + cmr_size))
>+			return true;
>+	}
>+
>+	return false;
>+}
>+
> /*
>  * Go through @tmb_list to find holes between memory areas.  If any of

The logic here is:
1. go through @tmb_list to find holes
2. skip a hole if it is in CMRs

I am wondering if the kernel can traverse CMRs directly to find holes. This
way, the new is_cmr_subregion() can be removed. And @tmb_list can be dropped
from a few functions e.g., tdmr_populate_rsvd_holes/areas/areas_all(). So, this
will simplify those functions a bit.

>  * those holes fall within @tdmr, set up a TDMR reserved area to cover
>@@ -835,7 +932,8 @@ static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
> static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
> 				    struct tdmr_info *tdmr,
> 				    int *rsvd_idx,
>-				    u16 max_reserved_per_tdmr)
>+				    u16 max_reserved_per_tdmr,
>+				    struct tdx_sysinfo_cmr_info *cmr_info)

Maybe this function can accept a pointer to tdx_sysinfo and remove
@max_reserved_per_tdmr and @cmr_info because they are both TDX metadata and
have only one possible combination for a given TDX module. Anyway, I don't have
a strong opinion on this.

> {
> 	struct tdx_memblock *tmb;
> 	u64 prev_end;
>@@ -864,10 +962,16 @@ static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
> 		 * Skip over memory areas that
> 		 * have already been dealt with.
> 		 */
>-		if (start <= prev_end) {
>-			prev_end = end;
>-			continue;
>-		}
>+		if (start <= prev_end)
>+			goto next_tmb;
>+
>+		/*
>+		 * Found the hole [prev_end, start) before this region.
>+		 * Skip the hole if it is within any CMR to reduce the
>+		 * consumption of reserved areas.
>+		 */
>+		if (is_cmr_subregion(cmr_info, prev_end, start))
>+			goto next_tmb;
> 
> 		/* Add the hole before this region */
> 		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
>@@ -876,11 +980,16 @@ static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
> 		if (ret)
> 			return ret;
> 
>+next_tmb:
> 		prev_end = end;
> 	}
> 
>-	/* Add the hole after the last region if it exists. */
>-	if (prev_end < tdmr_end(tdmr)) {
>+	/*
>+	 * Add the hole after the last region if it exists, but skip
>+	 * if it is within any CMR.
>+	 */
>+	if (prev_end < tdmr_end(tdmr) &&
>+			!is_cmr_subregion(cmr_info, prev_end, tdmr_end(tdmr))) {
> 		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
> 				tdmr_end(tdmr) - prev_end,
> 				max_reserved_per_tdmr);
Huang, Kai June 18, 2024, 12:12 a.m. UTC | #2
On 17/06/2024 10:54 pm, Chao Gao wrote:
>> +/* Return whether a given region [start, end) is a sub-region of any CMR */
>> +static bool is_cmr_subregion(struct tdx_sysinfo_cmr_info *cmr_info, u64 start,
>> +			    u64 end)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < cmr_info->num_cmrs; i++) {
>> +		u64 cmr_base = cmr_info->cmr_base[i];
>> +		u64 cmr_size = cmr_info->cmr_size[i];
>> +
>> +		if (start >= cmr_base && end <= (cmr_base + cmr_size))
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> /*
>>   * Go through @tmb_list to find holes between memory areas.  If any of
> 
> The logic here is:
> 1. go through @tmb_list to find holes
> 2. skip a hole if it is in CMRs
> 
> I am wondering if the kernel can traverse CMRs directly to find holes. This
> way, the new is_cmr_subregion() can be removed. And @tmb_list can be dropped
> from a few functions e.g., tdmr_populate_rsvd_holes/areas/areas_all(). So, this
> will simplify those functions a bit.

Traversing CMRs to find reserved areas for a given TDMR sounds good to 
me logically.  The whole construct_tdmrs() assumes all TDX memory blocks 
are fully covered by CMRs anyway.

I'll try this out (validation is a bit tricky because we cannot 
reproduce this issue internally).

> 
>>   * those holes fall within @tdmr, set up a TDMR reserved area to cover
>> @@ -835,7 +932,8 @@ static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
>> static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
>> 				    struct tdmr_info *tdmr,
>> 				    int *rsvd_idx,
>> -				    u16 max_reserved_per_tdmr)
>> +				    u16 max_reserved_per_tdmr,
>> +				    struct tdx_sysinfo_cmr_info *cmr_info)
> 
> Maybe this function can accept a pointer to tdx_sysinfo and remove
> @max_reserved_per_tdmr and @cmr_info because they are both TDX metadata and
> have only one possible combination for a given TDX module. Anyway, I don't have
> a strong opinion on this.

There are pros/cons of the two options.  Passing the @cmr_info and 
@max_reserved_per_tdmr makes this function more clear that it _exactly_ 
requires these two.  Since passing a single @tdx_sysinfo doesn't reduce 
the function arguments a lot (only one argument, and you have to get the 
two inside the function anyway), I would keep the current way unless I 
hear from something different from others.
Nikolay Borisov June 18, 2024, 3:10 p.m. UTC | #3
On 16.06.24 г. 15:01 ч., Kai Huang wrote:
> A TDX module initialization failure was reported on a Emerald Rapids
> platform:
> 
>    virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
>    virt/tdx: module initialization failed (-28)
> 
> As a step of initializing the TDX module, the kernel tells the TDX
> module all the "TDX-usable memory regions" via a set of TDX architecture
> defined structure "TD Memory Region" (TDMR).  Each TDMR must be in 1GB
> aligned and in 1GB granularity, and all "non-TDX-usable memory holes" in
> a given TDMR must be marked as a "reserved area".  Each TDMR only
> supports a maximum number of reserved areas reported by the TDX module.
> 
> As shown above, the root cause of this failure is when the kernel tries
> to construct a TDMR to cover address range [0x0, 0x80000000), there
> are too many memory holes within that range and the number of memory
> holes exceeds the maximum number of reserved areas.
> 
> The E820 table of that platform (see [1] below) reflects this: the
> number of memory holes among e820 "usable" entries exceeds 16, which is
> the maximum number of reserved areas TDX module supports in practice.
> 
> === Fix ===
> 
> There are two options to fix this: 1) put less memory holes as "reserved
> area" when constructing a TDMR; 2) reduce the TDMR's size to cover less
> memory regions, thus less memory holes.
> 
> Option 1) is possible, and in fact is easier and preferable:
> 
> TDX actually has a concept of "Convertible Memory Regions" (CMRs).  TDX
> reports a list of CMRs that meet TDX's security requirements on memory.
> TDX requires all the "TDX-usable memory regions" that the kernel passes
> to the module via TDMRs, a.k.a, all the "non-reserved regions in TDMRs",
> must be convertible memory.
> 
> In other words, if a memory hole is indeed CMR, then it's not mandatory

So TDX requires all TDMR to be CMR, and CMR regions are reported by the 
BIOS, how did you arrive at the conclusion that if a hole is CMR there 
is no point in creating a TDMR for it?

> for the kernel to add it to the reserved areas.  The number of consumed
> reserved areas can be reduced if the kernel doesn't add those memory
> holes as reserved area.  Note this doesn't have security impact because
> the kernel is out of TDX's TCB anyway.
> 
> This is feasible because in practice the CMRs just reflect the nature of
> whether the RAM can indeed be used by TDX, thus each CMR tends to be a
> large range w/o being split into small areas, e.g., in the way the e820
> table does to contain a lot "ACPI *" entries.  [2] below shows the CMRs
> reported on the problematic platform (using the off-tree TDX code).
> 
> So for this particular module initialization failure, the memory holes
> that are within [0x0, 0x80000000) are mostly indeed CMR.  By not adding
> them to the reserved areas, the number of consumed reserved areas for
> the TDMR [0x0, 0x80000000) can be dramatically reduced.
> 
> On the other hand, although option 2) is also theoretically feasible, it
> requires more complicated logic to handle around splitting TDMR into
> smaller ones.  E.g., today one memory region must be fully in one TDMR,
> while splitting TDMR will result in each TDMR only covering part of some
> memory region.  And this also increases the total number of TDMRs, which
> also cannot exceed a maximum value that TDX module supports.
> 

<snip>

> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>   arch/x86/virt/vmx/tdx/tdx.c | 149 ++++++++++++++++++++++++++++++++----
>   arch/x86/virt/vmx/tdx/tdx.h |  13 ++++
>   2 files changed, 146 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index ced40e3b516e..88a0c8b788b7 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -293,6 +293,10 @@ static int stbuf_read_sysmd_field(u64 field_id, void *stbuf, int offset,
>   	return 0;
>   }
>   
> +/* Wrapper to read one metadata field to u8/u16/u32/u64 */
> +#define stbuf_read_sysmd_single(_field_id, _pdata)	\
> +	stbuf_read_sysmd_field(_field_id, _pdata, 0, sizeof(typeof(*(_pdata))))

What value does adding yet another level of indirection bring here?

> +
>   struct field_mapping {
>   	u64 field_id;
>   	int offset;
> @@ -349,6 +353,76 @@ static int get_tdx_module_version(struct tdx_sysinfo_module_version *modver)
>   	return stbuf_read_sysmd_multi(fields, ARRAY_SIZE(fields), modver);
>   }
>   
> +/* Update the @cmr_info->num_cmrs to trim tail empty CMRs */
> +static void trim_empty_tail_cmrs(struct tdx_sysinfo_cmr_info *cmr_info)
> +{
> +	int i;
> +
> +	for (i = 0; i < cmr_info->num_cmrs; i++) {
> +		u64 cmr_base = cmr_info->cmr_base[i];
> +		u64 cmr_size = cmr_info->cmr_size[i];
> +
> +		if (!cmr_size) {
> +			WARN_ON_ONCE(cmr_base);
> +			break;
> +		}
> +
> +		/* TDX architecture: CMR must be 4KB aligned */
> +		WARN_ON_ONCE(!PAGE_ALIGNED(cmr_base) ||
> +				!PAGE_ALIGNED(cmr_size));
> +	}
> +
> +	cmr_info->num_cmrs = i;
> +}

That function is somewhat weird, on the one hand its name suggests it's 
doing some "optimisation" i.e removing empty cmrs, at the same time it 
will simply cap the number of CMRs until it meets the first empty CMR, 
what aif we have and will also WARN. In fact it could even crash the 
machine if panic_on_warn is enabled, furthermore the alignement checks 
suggest it's actually some sanity checking function. Furthermore if we 
have:"

ORDINARY_CMR,EMPTY_CMR,ORDINARY_CMR

(Is such a scenario even possible), in this case we'll ommit also the 
last ordinary cmr region?

> +
> +#define TD_SYSINFO_MAP_CMR_INFO(_field_id, _member)	\
> +	TD_SYSINFO_MAP(_field_id, struct tdx_sysinfo_cmr_info, _member)

nit: Again, no real value in introducing yet another level of 
indirection in this case.

> +
> +static int get_tdx_cmr_info(struct tdx_sysinfo_cmr_info *cmr_info)
> +{
> +	int i, ret;
> +
> +	ret = stbuf_read_sysmd_single(MD_FIELD_ID_NUM_CMRS,
> +			&cmr_info->num_cmrs);
> +	if (ret)
> +		return ret;
> +
> +	for (i = 0; i < cmr_info->num_cmrs; i++) {
> +		const struct field_mapping fields[] = {
> +			TD_SYSINFO_MAP_CMR_INFO(CMR_BASE0 + i, cmr_base[i]),
> +			TD_SYSINFO_MAP_CMR_INFO(CMR_SIZE0 + i, cmr_size[i]),
> +		};
> +
> +		ret = stbuf_read_sysmd_multi(fields, ARRAY_SIZE(fields),
> +				cmr_info);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	/*
> +	 * The TDX module may just report the maximum number of CMRs that
> +	 * TDX architecturally supports as the actual number of CMRs,
> +	 * despite the latter is smaller.  In this case all the tail
> +	 * CMRs will be empty.  Trim them away.
> +	 */
> +	trim_empty_tail_cmrs(cmr_info);
> +
> +	return 0;
> +}
> +
> +static void print_cmr_info(struct tdx_sysinfo_cmr_info *cmr_info)
> +{
> +	int i;
> +
> +	for (i = 0; i < cmr_info->num_cmrs; i++) {
> +		u64 cmr_base = cmr_info->cmr_base[i];
> +		u64 cmr_size = cmr_info->cmr_size[i];
> +
> +		pr_info("CMR[%d]: [0x%llx, 0x%llx)\n", i, cmr_base,
> +				cmr_base + cmr_size);
> +	}
> +}

Do we really want to always print all CMR regions, won't that become way 
too spammy and isn't this really useful in debug scenarios? Perhaps gate 
this particular information behind a debug flag?

> +
>   static void print_basic_sysinfo(struct tdx_sysinfo *sysinfo)
>   {
>   	struct tdx_sysinfo_module_version *modver = &sysinfo->module_version;

<snip>
Huang, Kai June 19, 2024, 1:23 a.m. UTC | #4
On Tue, 2024-06-18 at 18:10 +0300, Nikolay Borisov wrote:
> 
> On 16.06.24 г. 15:01 ч., Kai Huang wrote:
> > A TDX module initialization failure was reported on a Emerald Rapids
> > platform:
> > 
> >    virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
> >    virt/tdx: module initialization failed (-28)
> > 
> > As a step of initializing the TDX module, the kernel tells the TDX
> > module all the "TDX-usable memory regions" via a set of TDX architecture
> > defined structure "TD Memory Region" (TDMR).  Each TDMR must be in 1GB
> > aligned and in 1GB granularity, and all "non-TDX-usable memory holes" in
> > a given TDMR must be marked as a "reserved area".  Each TDMR only
> > supports a maximum number of reserved areas reported by the TDX module.
> > 
> > As shown above, the root cause of this failure is when the kernel tries
> > to construct a TDMR to cover address range [0x0, 0x80000000), there
> > are too many memory holes within that range and the number of memory
> > holes exceeds the maximum number of reserved areas.
> > 
> > The E820 table of that platform (see [1] below) reflects this: the
> > number of memory holes among e820 "usable" entries exceeds 16, which is
> > the maximum number of reserved areas TDX module supports in practice.
> > 
> > === Fix ===
> > 
> > There are two options to fix this: 1) put less memory holes as "reserved
> > area" when constructing a TDMR; 2) reduce the TDMR's size to cover less
> > memory regions, thus less memory holes.
> > 
> > Option 1) is possible, and in fact is easier and preferable:
> > 
> > TDX actually has a concept of "Convertible Memory Regions" (CMRs).  TDX
> > reports a list of CMRs that meet TDX's security requirements on memory.
> > TDX requires all the "TDX-usable memory regions" that the kernel passes
> > to the module via TDMRs, a.k.a, all the "non-reserved regions in TDMRs",
> > must be convertible memory.
> > 
> > In other words, if a memory hole is indeed CMR, then it's not mandatory
> 
> So TDX requires all TDMR to be CMR, and CMR regions are reported by the 
> BIOS, how did you arrive at the conclusion that if a hole is CMR there 
> is no point in creating a TDMR for it?

TDX requires all "non-reserved areas" in one TDMR to be CMR.  TDMR is 1GB
aligned, and CMR is 4KB aligned.  From TDX's perspective, the kernel must
at least add those non-CMR holes as "reserved area".  However the kernel
_can_ adds more holes (in e820) as "reserved area", despite those holes
are CMR physically.

> 
> > for the kernel to add it to the reserved areas.  The number of consumed
> > reserved areas can be reduced if the kernel doesn't add those memory
> > holes as reserved area.  Note this doesn't have security impact because
> > the kernel is out of TDX's TCB anyway.
> > 
> > This is feasible because in practice the CMRs just reflect the nature of
> > whether the RAM can indeed be used by TDX, thus each CMR tends to be a
> > large range w/o being split into small areas, e.g., in the way the e820
> > table does to contain a lot "ACPI *" entries.  [2] below shows the CMRs
> > reported on the problematic platform (using the off-tree TDX code).
> > 
> > So for this particular module initialization failure, the memory holes
> > that are within [0x0, 0x80000000) are mostly indeed CMR.  By not adding
> > them to the reserved areas, the number of consumed reserved areas for
> > the TDMR [0x0, 0x80000000) can be dramatically reduced.
> > 
> > On the other hand, although option 2) is also theoretically feasible, it
> > requires more complicated logic to handle around splitting TDMR into
> > smaller ones.  E.g., today one memory region must be fully in one TDMR,
> > while splitting TDMR will result in each TDMR only covering part of some
> > memory region.  And this also increases the total number of TDMRs, which
> > also cannot exceed a maximum value that TDX module supports.
> > 
> 
> <snip>
> 
> > 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >   arch/x86/virt/vmx/tdx/tdx.c | 149 ++++++++++++++++++++++++++++++++----
> >   arch/x86/virt/vmx/tdx/tdx.h |  13 ++++
> >   2 files changed, 146 insertions(+), 16 deletions(-)
> > 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index ced40e3b516e..88a0c8b788b7 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -293,6 +293,10 @@ static int stbuf_read_sysmd_field(u64 field_id, void *stbuf, int offset,
> >   	return 0;
> >   }
> >   
> > +/* Wrapper to read one metadata field to u8/u16/u32/u64 */
> > +#define stbuf_read_sysmd_single(_field_id, _pdata)	\
> > +	stbuf_read_sysmd_field(_field_id, _pdata, 0, sizeof(typeof(*(_pdata))))
> 
> What value does adding yet another level of indirection bring here?

We could use the raw version instead: read_sys_metadata_field().

This wrapper additionally checks the 'element size' encoded in the field
ID matches the size that passed in, so it can catch potential kernel bug.

But I can remove this to simplify the code.

> 
> > +
> >   struct field_mapping {
> >   	u64 field_id;
> >   	int offset;
> > @@ -349,6 +353,76 @@ static int get_tdx_module_version(struct tdx_sysinfo_module_version *modver)
> >   	return stbuf_read_sysmd_multi(fields, ARRAY_SIZE(fields), modver);
> >   }
> >   
> > +/* Update the @cmr_info->num_cmrs to trim tail empty CMRs */
> > +static void trim_empty_tail_cmrs(struct tdx_sysinfo_cmr_info *cmr_info)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < cmr_info->num_cmrs; i++) {
> > +		u64 cmr_base = cmr_info->cmr_base[i];
> > +		u64 cmr_size = cmr_info->cmr_size[i];
> > +
> > +		if (!cmr_size) {
> > +			WARN_ON_ONCE(cmr_base);
> > +			break;
> > +		}
> > +
> > +		/* TDX architecture: CMR must be 4KB aligned */
> > +		WARN_ON_ONCE(!PAGE_ALIGNED(cmr_base) ||
> > +				!PAGE_ALIGNED(cmr_size));
> > +	}
> > +
> > +	cmr_info->num_cmrs = i;
> > +}
> 
> That function is somewhat weird, on the one hand its name suggests it's 
> doing some "optimisation" i.e removing empty cmrs, at the same time it 
> will simply cap the number of CMRs until it meets the first empty CMR, 

(sorry I found it's hard to read the text above, but I am trying to reply)

I wouldn't say it is "optimization".  We just want to exclude the empty
CMRs so that we don't need to worry about this case when we need to loop
over all CMRs.

> what aif we have and will also WARN. In fact it could even crash the 
> machine if panic_on_warn is enabled, 
> 

I don't follow why we "will" WARN()?  If the BIOS is sane, then this code
will never WARN().  

In fact, if there are some error in the reported CMRs, the TDX is likely
to be broken completely, and we cannot trust the machine to be in working
state.  So if any WARN() actually happens, to me it's OK to panic the
kernel.


> furthermore the alignement checks 
> suggest it's actually some sanity checking function. Furthermore if we 
> have:"
> 
> ORDINARY_CMR,EMPTY_CMR,ORDINARY_CMR
> 
> (Is such a scenario even possible), in this case we'll ommit also the 
> last ordinary cmr region?

It cannot happen.

The fact is:

1) CMR base/size are 4KB aligned.  This is architectural behaviour.
2) TDX architecturally supports 32 CMRs maximumly;
3) In practice, TDX can just report the 'NUM_CMRS' metadata field as 32,
but there can be empty/null CMRs following valid CMRs.
4) A empty/null CMR between valid CMRs cannot happen.

So the kernel at least needs to skip/trim those empty/null tail CMRs. 
This can be done in a function like above, or we manually check cmr's
base/size being 0 when we loop over CMRs.

Having a function to trim the tail empty CMRs is way more straightforward.

In terms of sanity check, the question is what kind level of sanity check
we want the kernel to do.  To confess, the WARN()s that were added in this
patch is basically because I found them are straightforward to add,
because I don't want to add a lot of sanity check just to guard something
cannot happen in practice.

How about we just remove those WARN()s?

> 
> > +
> > +#define TD_SYSINFO_MAP_CMR_INFO(_field_id, _member)	\
> > +	TD_SYSINFO_MAP(_field_id, struct tdx_sysinfo_cmr_info, _member)
> 
> nit: Again, no real value in introducing yet another level of 
> indirection in this case.

See above.

> 
> > +
> > +static int get_tdx_cmr_info(struct tdx_sysinfo_cmr_info *cmr_info)
> > +{
> > +	int i, ret;
> > +
> > +	ret = stbuf_read_sysmd_single(MD_FIELD_ID_NUM_CMRS,
> > +			&cmr_info->num_cmrs);
> > +	if (ret)
> > +		return ret;
> > +
> > +	for (i = 0; i < cmr_info->num_cmrs; i++) {
> > +		const struct field_mapping fields[] = {
> > +			TD_SYSINFO_MAP_CMR_INFO(CMR_BASE0 + i, cmr_base[i]),
> > +			TD_SYSINFO_MAP_CMR_INFO(CMR_SIZE0 + i, cmr_size[i]),
> > +		};
> > +
> > +		ret = stbuf_read_sysmd_multi(fields, ARRAY_SIZE(fields),
> > +				cmr_info);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	/*
> > +	 * The TDX module may just report the maximum number of CMRs that
> > +	 * TDX architecturally supports as the actual number of CMRs,
> > +	 * despite the latter is smaller.  In this case all the tail
> > +	 * CMRs will be empty.  Trim them away.
> > +	 */
> > +	trim_empty_tail_cmrs(cmr_info);
> > +
> > +	return 0;
> > +}
> > +
> > +static void print_cmr_info(struct tdx_sysinfo_cmr_info *cmr_info)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < cmr_info->num_cmrs; i++) {
> > +		u64 cmr_base = cmr_info->cmr_base[i];
> > +		u64 cmr_size = cmr_info->cmr_size[i];
> > +
> > +		pr_info("CMR[%d]: [0x%llx, 0x%llx)\n", i, cmr_base,
> > +				cmr_base + cmr_size);
> > +	}
> > +}
> 
> Do we really want to always print all CMR regions, won't that become way 
> too spammy and isn't this really useful in debug scenarios? Perhaps gate 
> this particular information behind a debug flag?

It is helpful when something goes wrong, and that could happen for non-
debug kernels.  I think we should just print it like e820 table, for which
the kernel uses pr_info().
Huang, Kai June 19, 2024, 4:58 a.m. UTC | #5
> > > +/* Wrapper to read one metadata field to u8/u16/u32/u64 */
> > > +#define stbuf_read_sysmd_single(_field_id, _pdata)	\
> > > +	stbuf_read_sysmd_field(_field_id, _pdata, 0, sizeof(typeof(*(_pdata))))
> > 
> > What value does adding yet another level of indirection bring here?
> 
> We could use the raw version instead: read_sys_metadata_field().
> 
> This wrapper additionally checks the 'element size' encoded in the field
> ID matches the size that passed in, so it can catch potential kernel bug.
> 
> But I can remove this to simplify the code.
> 

Sorry I didn't finish my reply properly as I was interrupted.

Another advantage of this wrapper is, as mentioned in the comment of it,
it works with 'u8/u16/u32/u64' directly, while the raw
read_sys_metadata_field() always puts data to a 'u64'.

An example is, ...

> > > +
> > > +	ret = stbuf_read_sysmd_single(MD_FIELD_ID_NUM_CMRS,
> > > +			&cmr_info->num_cmrs);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > 

... here to read cmr_info->num_cmrs, which is a 'u16'.

So it is helpful to me.
Binbin Wu July 5, 2024, 3 a.m. UTC | #6
On 6/19/2024 9:23 AM, Huang, Kai wrote:
[...]
>
>> furthermore the alignement checks
>> suggest it's actually some sanity checking function. Furthermore if we
>> have:"
>>
>> ORDINARY_CMR,EMPTY_CMR,ORDINARY_CMR
>>
>> (Is such a scenario even possible), in this case we'll ommit also the
>> last ordinary cmr region?
> It cannot happen.
>
> The fact is:
>
> 1) CMR base/size are 4KB aligned.  This is architectural behaviour.
> 2) TDX architecturally supports 32 CMRs maximumly;
Do you think it's worth a comment to the definition of TDX_MAX_CMRS that 
the number is architectural?

> 3) In practice, TDX can just report the 'NUM_CMRS' metadata field as 32,
> but there can be empty/null CMRs following valid CMRs.
> 4) A empty/null CMR between valid CMRs cannot happen.
>
>
Huang, Kai July 5, 2024, 9:36 a.m. UTC | #7
On Fri, 2024-07-05 at 11:00 +0800, Binbin Wu wrote:
> 
> On 6/19/2024 9:23 AM, Huang, Kai wrote:
> [...]
> > 
> > > furthermore the alignement checks
> > > suggest it's actually some sanity checking function. Furthermore if we
> > > have:"
> > > 
> > > ORDINARY_CMR,EMPTY_CMR,ORDINARY_CMR
> > > 
> > > (Is such a scenario even possible), in this case we'll ommit also the
> > > last ordinary cmr region?
> > It cannot happen.
> > 
> > The fact is:
> > 
> > 1) CMR base/size are 4KB aligned.  This is architectural behaviour.
> > 2) TDX architecturally supports 32 CMRs maximumly;
> Do you think it's worth a comment to the definition of TDX_MAX_CMRS that 
> the number is architectural?
> 

OK will do. thanks.
diff mbox series

Patch

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ced40e3b516e..88a0c8b788b7 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -293,6 +293,10 @@  static int stbuf_read_sysmd_field(u64 field_id, void *stbuf, int offset,
 	return 0;
 }
 
+/* Wrapper to read one metadata field to u8/u16/u32/u64 */
+#define stbuf_read_sysmd_single(_field_id, _pdata)	\
+	stbuf_read_sysmd_field(_field_id, _pdata, 0, sizeof(typeof(*(_pdata))))
+
 struct field_mapping {
 	u64 field_id;
 	int offset;
@@ -349,6 +353,76 @@  static int get_tdx_module_version(struct tdx_sysinfo_module_version *modver)
 	return stbuf_read_sysmd_multi(fields, ARRAY_SIZE(fields), modver);
 }
 
+/* Update the @cmr_info->num_cmrs to trim tail empty CMRs */
+static void trim_empty_tail_cmrs(struct tdx_sysinfo_cmr_info *cmr_info)
+{
+	int i;
+
+	for (i = 0; i < cmr_info->num_cmrs; i++) {
+		u64 cmr_base = cmr_info->cmr_base[i];
+		u64 cmr_size = cmr_info->cmr_size[i];
+
+		if (!cmr_size) {
+			WARN_ON_ONCE(cmr_base);
+			break;
+		}
+
+		/* TDX architecture: CMR must be 4KB aligned */
+		WARN_ON_ONCE(!PAGE_ALIGNED(cmr_base) ||
+				!PAGE_ALIGNED(cmr_size));
+	}
+
+	cmr_info->num_cmrs = i;
+}
+
+#define TD_SYSINFO_MAP_CMR_INFO(_field_id, _member)	\
+	TD_SYSINFO_MAP(_field_id, struct tdx_sysinfo_cmr_info, _member)
+
+static int get_tdx_cmr_info(struct tdx_sysinfo_cmr_info *cmr_info)
+{
+	int i, ret;
+
+	ret = stbuf_read_sysmd_single(MD_FIELD_ID_NUM_CMRS,
+			&cmr_info->num_cmrs);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < cmr_info->num_cmrs; i++) {
+		const struct field_mapping fields[] = {
+			TD_SYSINFO_MAP_CMR_INFO(CMR_BASE0 + i, cmr_base[i]),
+			TD_SYSINFO_MAP_CMR_INFO(CMR_SIZE0 + i, cmr_size[i]),
+		};
+
+		ret = stbuf_read_sysmd_multi(fields, ARRAY_SIZE(fields),
+				cmr_info);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * The TDX module may just report the maximum number of CMRs that
+	 * TDX architecturally supports as the actual number of CMRs,
+	 * despite the latter is smaller.  In this case all the tail
+	 * CMRs will be empty.  Trim them away.
+	 */
+	trim_empty_tail_cmrs(cmr_info);
+
+	return 0;
+}
+
+static void print_cmr_info(struct tdx_sysinfo_cmr_info *cmr_info)
+{
+	int i;
+
+	for (i = 0; i < cmr_info->num_cmrs; i++) {
+		u64 cmr_base = cmr_info->cmr_base[i];
+		u64 cmr_size = cmr_info->cmr_size[i];
+
+		pr_info("CMR[%d]: [0x%llx, 0x%llx)\n", i, cmr_base,
+				cmr_base + cmr_size);
+	}
+}
+
 static void print_basic_sysinfo(struct tdx_sysinfo *sysinfo)
 {
 	struct tdx_sysinfo_module_version *modver = &sysinfo->module_version;
@@ -372,6 +446,8 @@  static void print_basic_sysinfo(struct tdx_sysinfo *sysinfo)
 			modver->major,		modver->minor,
 			modver->update,		modver->internal,
 			modver->build_num,	modver->build_date);
+
+	print_cmr_info(&sysinfo->cmr_info);
 }
 
 #define TD_SYSINFO_MAP_TDMR_INFO(_field_id, _member)	\
@@ -404,6 +480,10 @@  static int get_tdx_sysinfo(struct tdx_sysinfo *sysinfo)
 	if (ret)
 		return ret;
 
+	ret = get_tdx_cmr_info(&sysinfo->cmr_info);
+	if (ret)
+		return ret;
+
 	return get_tdx_tdmr_sysinfo(&sysinfo->tdmr_info);
 }
 
@@ -827,6 +907,23 @@  static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
 	return 0;
 }
 
+/* Return whether a given region [start, end) is a sub-region of any CMR */
+static bool is_cmr_subregion(struct tdx_sysinfo_cmr_info *cmr_info, u64 start,
+			    u64 end)
+{
+	int i;
+
+	for (i = 0; i < cmr_info->num_cmrs; i++) {
+		u64 cmr_base = cmr_info->cmr_base[i];
+		u64 cmr_size = cmr_info->cmr_size[i];
+
+		if (start >= cmr_base && end <= (cmr_base + cmr_size))
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * Go through @tmb_list to find holes between memory areas.  If any of
  * those holes fall within @tdmr, set up a TDMR reserved area to cover
@@ -835,7 +932,8 @@  static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
 static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
 				    struct tdmr_info *tdmr,
 				    int *rsvd_idx,
-				    u16 max_reserved_per_tdmr)
+				    u16 max_reserved_per_tdmr,
+				    struct tdx_sysinfo_cmr_info *cmr_info)
 {
 	struct tdx_memblock *tmb;
 	u64 prev_end;
@@ -864,10 +962,16 @@  static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
 		 * Skip over memory areas that
 		 * have already been dealt with.
 		 */
-		if (start <= prev_end) {
-			prev_end = end;
-			continue;
-		}
+		if (start <= prev_end)
+			goto next_tmb;
+
+		/*
+		 * Found the hole [prev_end, start) before this region.
+		 * Skip the hole if it is within any CMR to reduce the
+		 * consumption of reserved areas.
+		 */
+		if (is_cmr_subregion(cmr_info, prev_end, start))
+			goto next_tmb;
 
 		/* Add the hole before this region */
 		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
@@ -876,11 +980,16 @@  static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
 		if (ret)
 			return ret;
 
+next_tmb:
 		prev_end = end;
 	}
 
-	/* Add the hole after the last region if it exists. */
-	if (prev_end < tdmr_end(tdmr)) {
+	/*
+	 * Add the hole after the last region if it exists, but skip
+	 * if it is within any CMR.
+	 */
+	if (prev_end < tdmr_end(tdmr) &&
+			!is_cmr_subregion(cmr_info, prev_end, tdmr_end(tdmr))) {
 		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
 				tdmr_end(tdmr) - prev_end,
 				max_reserved_per_tdmr);
@@ -956,12 +1065,13 @@  static int rsvd_area_cmp_func(const void *a, const void *b)
 static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
 				    struct list_head *tmb_list,
 				    struct tdmr_info_list *tdmr_list,
-				    u16 max_reserved_per_tdmr)
+				    u16 max_reserved_per_tdmr,
+				    struct tdx_sysinfo_cmr_info *cmr_info)
 {
 	int ret, rsvd_idx = 0;
 
 	ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
-			max_reserved_per_tdmr);
+			max_reserved_per_tdmr, cmr_info);
 	if (ret)
 		return ret;
 
@@ -979,11 +1089,13 @@  static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
 
 /*
  * Populate reserved areas for all TDMRs in @tdmr_list, including memory
- * holes (via @tmb_list) and PAMTs.
+ * holes (via @tmb_list) and PAMTs.  Exclude the memory holes within any
+ * CMR to reduce number of consumed reserved areas.
  */
 static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
 					 struct list_head *tmb_list,
-					 u16 max_reserved_per_tdmr)
+					 u16 max_reserved_per_tdmr,
+					 struct tdx_sysinfo_cmr_info *cmr_info)
 {
 	int i;
 
@@ -991,7 +1103,8 @@  static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
 		int ret;
 
 		ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
-				tmb_list, tdmr_list, max_reserved_per_tdmr);
+				tmb_list, tdmr_list, max_reserved_per_tdmr,
+				cmr_info);
 		if (ret)
 			return ret;
 	}
@@ -1002,11 +1115,13 @@  static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
- * TDMR global information in @tdmr_sysinfo.
+ * TDMR global information in @tdmr_sysinfo and CMR information in
+ * @cmr_info.
  */
 static int construct_tdmrs(struct list_head *tmb_list,
 			   struct tdmr_info_list *tdmr_list,
-			   struct tdx_sysinfo_tdmr_info *tdmr_sysinfo)
+			   struct tdx_sysinfo_tdmr_info *tdmr_sysinfo,
+			   struct tdx_sysinfo_cmr_info *cmr_info)
 {
 	int ret;
 
@@ -1020,7 +1135,8 @@  static int construct_tdmrs(struct list_head *tmb_list,
 		return ret;
 
 	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
-			tdmr_sysinfo->max_reserved_per_tdmr);
+			tdmr_sysinfo->max_reserved_per_tdmr,
+			cmr_info);
 	if (ret)
 		tdmrs_free_pamt_all(tdmr_list);
 
@@ -1210,7 +1326,8 @@  static int init_tdx_module(void)
 		goto err_free_tdxmem;
 
 	/* Cover all TDX-usable memory regions in TDMRs */
-	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr_info);
+	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr_info,
+			&sysinfo.cmr_info);
 	if (ret)
 		goto err_free_tdmrs;
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index d80ec797fbf1..be93b6f31e5b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -40,6 +40,10 @@ 
 #define MD_FIELD_ID_UPDATE_VERSION		0x0800000100000005ULL
 #define MD_FIELD_ID_INTERNAL_VERSION		0x0800000100000006ULL
 
+#define MD_FIELD_ID_NUM_CMRS			0x9000000100000000ULL
+#define MD_FIELD_ID_CMR_BASE0			0x9000000300000080ULL
+#define MD_FIELD_ID_CMR_SIZE0			0x9000000300000100ULL
+
 #define MD_FIELD_ID_MAX_TDMRS			0x9100000100000008ULL
 #define MD_FIELD_ID_MAX_RESERVED_PER_TDMR	0x9100000100000009ULL
 #define MD_FIELD_ID_PAMT_4K_ENTRY_SIZE		0x9100000100000010ULL
@@ -155,6 +159,14 @@  struct tdx_sysinfo_module_version {
 	u32 build_date;
 };
 
+/* Class "CMR Info" */
+#define TDX_MAX_CMRS	32
+struct tdx_sysinfo_cmr_info {
+	u16 num_cmrs;
+	u64 cmr_base[TDX_MAX_CMRS];
+	u64 cmr_size[TDX_MAX_CMRS];
+};
+
 /* Class "TDMR Info" */
 struct tdx_sysinfo_tdmr_info {
 	u16 max_tdmrs;
@@ -165,6 +177,7 @@  struct tdx_sysinfo_tdmr_info {
 struct tdx_sysinfo {
 	struct tdx_sysinfo_module_info		module_info;
 	struct tdx_sysinfo_module_version	module_version;
+	struct tdx_sysinfo_cmr_info		cmr_info;
 	struct tdx_sysinfo_tdmr_info		tdmr_info;
 };