diff mbox series

[1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported

Message ID 20231218140543.870234-2-tao1.su@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series x86: KVM: Limit guest physical bits when 5-level EPT is unsupported | expand

Commit Message

Tao Su Dec. 18, 2023, 2:05 p.m. UTC
When host doesn't support 5-level EPT, bits 51:48 of the guest physical
address must all be zero, otherwise an EPT violation always occurs and
current handler can't resolve this if the gpa is in RAM region. Hence,
instruction will keep being executed repeatedly, which causes infinite
EPT violation.

Six KVM selftests are timeout due to this issue:
    kvm:access_tracking_perf_test
    kvm:demand_paging_test
    kvm:dirty_log_test
    kvm:dirty_log_perf_test
    kvm:kvm_page_table_test
    kvm:memslot_modification_stress_test

The above selftests add a RAM region close to max_gfn, if host has 52
physical bits but doesn't support 5-level EPT, these will trigger infinite
EPT violation when access the RAM region.

Since current Intel CPUID doesn't report max guest physical bits like AMD,
introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
enabled and report the max guest physical bits which is smaller than host.

When guest physical bits is smaller than host, some GPA are illegal from
guest's perspective, but are still legal from hardware's perspective,
which should be trapped to inject #PF. Current KVM already has a parameter
allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
is enabled and guest accesses an illegal address from guest's perspective,
KVM will utilize EPT violation and emulate the instruction to inject #PF
and determine #PF error code.

Reported-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
---
 arch/x86/kvm/cpuid.c   | 5 +++--
 arch/x86/kvm/mmu.h     | 1 +
 arch/x86/kvm/mmu/mmu.c | 7 +++++++
 3 files changed, 11 insertions(+), 2 deletions(-)

Comments

Sean Christopherson Dec. 18, 2023, 3:13 p.m. UTC | #1
On Mon, Dec 18, 2023, Tao Su wrote:
> When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> address must all be zero, otherwise an EPT violation always occurs and
> current handler can't resolve this if the gpa is in RAM region. Hence,
> instruction will keep being executed repeatedly, which causes infinite
> EPT violation.
> 
> Six KVM selftests are timeout due to this issue:
>     kvm:access_tracking_perf_test
>     kvm:demand_paging_test
>     kvm:dirty_log_test
>     kvm:dirty_log_perf_test
>     kvm:kvm_page_table_test
>     kvm:memslot_modification_stress_test
> 
> The above selftests add a RAM region close to max_gfn, if host has 52
> physical bits but doesn't support 5-level EPT, these will trigger infinite
> EPT violation when access the RAM region.
> 
> Since current Intel CPUID doesn't report max guest physical bits like AMD,
> introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> enabled and report the max guest physical bits which is smaller than host.
> 
> When guest physical bits is smaller than host, some GPA are illegal from
> guest's perspective, but are still legal from hardware's perspective,
> which should be trapped to inject #PF. Current KVM already has a parameter
> allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> is enabled and guest accesses an illegal address from guest's perspective,
> KVM will utilize EPT violation and emulate the instruction to inject #PF
> and determine #PF error code.

No, fix the selftests, it's not KVM's responsibility to advertise the correct
guest.MAXPHYADDR.
Chao Gao Dec. 19, 2023, 2:51 a.m. UTC | #2
On Mon, Dec 18, 2023 at 07:13:27AM -0800, Sean Christopherson wrote:
>On Mon, Dec 18, 2023, Tao Su wrote:
>> When host doesn't support 5-level EPT, bits 51:48 of the guest physical
>> address must all be zero, otherwise an EPT violation always occurs and
>> current handler can't resolve this if the gpa is in RAM region. Hence,
>> instruction will keep being executed repeatedly, which causes infinite
>> EPT violation.
>> 
>> Six KVM selftests are timeout due to this issue:
>>     kvm:access_tracking_perf_test
>>     kvm:demand_paging_test
>>     kvm:dirty_log_test
>>     kvm:dirty_log_perf_test
>>     kvm:kvm_page_table_test
>>     kvm:memslot_modification_stress_test
>> 
>> The above selftests add a RAM region close to max_gfn, if host has 52
>> physical bits but doesn't support 5-level EPT, these will trigger infinite
>> EPT violation when access the RAM region.
>> 
>> Since current Intel CPUID doesn't report max guest physical bits like AMD,
>> introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
>> enabled and report the max guest physical bits which is smaller than host.
>> 
>> When guest physical bits is smaller than host, some GPA are illegal from
>> guest's perspective, but are still legal from hardware's perspective,
>> which should be trapped to inject #PF. Current KVM already has a parameter
>> allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
>> host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
>> can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
>> is enabled and guest accesses an illegal address from guest's perspective,
>> KVM will utilize EPT violation and emulate the instruction to inject #PF
>> and determine #PF error code.
>
>No, fix the selftests, it's not KVM's responsibility to advertise the correct
>guest.MAXPHYADDR.

In this case, host.MAXPHYADDR is 52 and EPT supports 4-level only thus can
translate up to 48 bits of GPA.

Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
is invalid/incorrect. how can we say selftests are at fault and we should fix
them?
Jim Mattson Dec. 19, 2023, 3:40 a.m. UTC | #3
On Mon, Dec 18, 2023 at 6:51 PM Chao Gao <chao.gao@intel.com> wrote:
>
> On Mon, Dec 18, 2023 at 07:13:27AM -0800, Sean Christopherson wrote:
> >On Mon, Dec 18, 2023, Tao Su wrote:
> >> When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> >> address must all be zero, otherwise an EPT violation always occurs and
> >> current handler can't resolve this if the gpa is in RAM region. Hence,
> >> instruction will keep being executed repeatedly, which causes infinite
> >> EPT violation.
> >>
> >> Six KVM selftests are timeout due to this issue:
> >>     kvm:access_tracking_perf_test
> >>     kvm:demand_paging_test
> >>     kvm:dirty_log_test
> >>     kvm:dirty_log_perf_test
> >>     kvm:kvm_page_table_test
> >>     kvm:memslot_modification_stress_test
> >>
> >> The above selftests add a RAM region close to max_gfn, if host has 52
> >> physical bits but doesn't support 5-level EPT, these will trigger infinite
> >> EPT violation when access the RAM region.
> >>
> >> Since current Intel CPUID doesn't report max guest physical bits like AMD,
> >> introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> >> enabled and report the max guest physical bits which is smaller than host.
> >>
> >> When guest physical bits is smaller than host, some GPA are illegal from
> >> guest's perspective, but are still legal from hardware's perspective,
> >> which should be trapped to inject #PF. Current KVM already has a parameter
> >> allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> >> host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> >> can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> >> is enabled and guest accesses an illegal address from guest's perspective,
> >> KVM will utilize EPT violation and emulate the instruction to inject #PF
> >> and determine #PF error code.
> >
> >No, fix the selftests, it's not KVM's responsibility to advertise the correct
> >guest.MAXPHYADDR.
>
> In this case, host.MAXPHYADDR is 52 and EPT supports 4-level only thus can
> translate up to 48 bits of GPA.

There are a number of issues that KVM does not handle when
guest.MAXPHYADDR < host.MAXPHYADDR. For starters, KVM doesn't raise a
#GP in PAE mode when one of the address bits above guest.MAXPHYADDR is
set in one of the PDPTRs.

Honestly, I think KVM should just disable EPT if the EPT tables can't
support the CPU's physical address width.

> Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
> is invalid/incorrect. how can we say selftests are at fault and we should fix
> them?

In this case, the CPU is at fault, and you should complain to the CPU vendor.
Chao Gao Dec. 19, 2023, 8:09 a.m. UTC | #4
On Mon, Dec 18, 2023 at 07:40:11PM -0800, Jim Mattson wrote:
>On Mon, Dec 18, 2023 at 6:51 PM Chao Gao <chao.gao@intel.com> wrote:
>>
>> On Mon, Dec 18, 2023 at 07:13:27AM -0800, Sean Christopherson wrote:
>> >On Mon, Dec 18, 2023, Tao Su wrote:
>> >> When host doesn't support 5-level EPT, bits 51:48 of the guest physical
>> >> address must all be zero, otherwise an EPT violation always occurs and
>> >> current handler can't resolve this if the gpa is in RAM region. Hence,
>> >> instruction will keep being executed repeatedly, which causes infinite
>> >> EPT violation.
>> >>
>> >> Six KVM selftests are timeout due to this issue:
>> >>     kvm:access_tracking_perf_test
>> >>     kvm:demand_paging_test
>> >>     kvm:dirty_log_test
>> >>     kvm:dirty_log_perf_test
>> >>     kvm:kvm_page_table_test
>> >>     kvm:memslot_modification_stress_test
>> >>
>> >> The above selftests add a RAM region close to max_gfn, if host has 52
>> >> physical bits but doesn't support 5-level EPT, these will trigger infinite
>> >> EPT violation when access the RAM region.
>> >>
>> >> Since current Intel CPUID doesn't report max guest physical bits like AMD,
>> >> introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
>> >> enabled and report the max guest physical bits which is smaller than host.
>> >>
>> >> When guest physical bits is smaller than host, some GPA are illegal from
>> >> guest's perspective, but are still legal from hardware's perspective,
>> >> which should be trapped to inject #PF. Current KVM already has a parameter
>> >> allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
>> >> host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
>> >> can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
>> >> is enabled and guest accesses an illegal address from guest's perspective,
>> >> KVM will utilize EPT violation and emulate the instruction to inject #PF
>> >> and determine #PF error code.
>> >
>> >No, fix the selftests, it's not KVM's responsibility to advertise the correct
>> >guest.MAXPHYADDR.
>>
>> In this case, host.MAXPHYADDR is 52 and EPT supports 4-level only thus can
>> translate up to 48 bits of GPA.
>
>There are a number of issues that KVM does not handle when
>guest.MAXPHYADDR < host.MAXPHYADDR. For starters, KVM doesn't raise a
>#GP in PAE mode when one of the address bits above guest.MAXPHYADDR is
>set in one of the PDPTRs.

These are long-standing issues I believe.

Note: current KVM ABI doesn't enforce guest.MAXPHYADDR = host.MAXPHYADDR
regardless of "allow_smaller_maxphyaddr".

>
>Honestly, I think KVM should just disable EPT if the EPT tables can't
>support the CPU's physical address width.

Yes, it is an option.
But I prefer to allow admin to override this (i.e., admin still can enable EPT
via module parameter) because those issues are not new and disabling EPT
doesn't prevent QEMU from launching guests w/ smaller MAXPHYADDR.

>
>> Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
>> is invalid/incorrect. how can we say selftests are at fault and we should fix
>> them?
>
>In this case, the CPU is at fault, and you should complain to the CPU vendor.

Yeah, I agree with you and will check with related team inside Intel. My point
was just this isn't a selftest issue because not all information is disclosed
to the tests.

And I am afraid KVM as L1 VMM may run into this situation, i.e., only 4-level
EPT is supported but MAXPHYADDR is 52. So, KVM needs a fix anyway.
Tao Su Dec. 19, 2023, 8:31 a.m. UTC | #5
On Mon, Dec 18, 2023 at 07:13:27AM -0800, Sean Christopherson wrote:
> On Mon, Dec 18, 2023, Tao Su wrote:
> > When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> > address must all be zero, otherwise an EPT violation always occurs and
> > current handler can't resolve this if the gpa is in RAM region. Hence,
> > instruction will keep being executed repeatedly, which causes infinite
> > EPT violation.
> > 
> > Six KVM selftests are timeout due to this issue:
> >     kvm:access_tracking_perf_test
> >     kvm:demand_paging_test
> >     kvm:dirty_log_test
> >     kvm:dirty_log_perf_test
> >     kvm:kvm_page_table_test
> >     kvm:memslot_modification_stress_test
> > 
> > The above selftests add a RAM region close to max_gfn, if host has 52
> > physical bits but doesn't support 5-level EPT, these will trigger infinite
> > EPT violation when access the RAM region.
> > 
> > Since current Intel CPUID doesn't report max guest physical bits like AMD,
> > introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> > enabled and report the max guest physical bits which is smaller than host.
> > 
> > When guest physical bits is smaller than host, some GPA are illegal from
> > guest's perspective, but are still legal from hardware's perspective,
> > which should be trapped to inject #PF. Current KVM already has a parameter
> > allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> > host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> > can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> > is enabled and guest accesses an illegal address from guest's perspective,
> > KVM will utilize EPT violation and emulate the instruction to inject #PF
> > and determine #PF error code.
> 
> No, fix the selftests, it's not KVM's responsibility to advertise the correct
> guest.MAXPHYADDR.

This patch is not for fixing these selftests, it is for fixing the issue
exposed by the selftests. KVM has responsibility to report a correct
guest.MAXPHYADDR, which lets userspace set a valid physical bits to KVM.

Actually, KVM will stuck in a loop if it tries to build spte with any bit
of bits[51:48] set, e.g., assign a huge RAM to guest.

Thanks,
Tao
Sean Christopherson Dec. 19, 2023, 3:26 p.m. UTC | #6
On Tue, Dec 19, 2023, Chao Gao wrote:
> On Mon, Dec 18, 2023 at 07:40:11PM -0800, Jim Mattson wrote:
> >Honestly, I think KVM should just disable EPT if the EPT tables can't
> >support the CPU's physical address width.
> 
> Yes, it is an option.
> But I prefer to allow admin to override this (i.e., admin still can enable EPT
> via module parameter) because those issues are not new and disabling EPT
> doesn't prevent QEMU from launching guests w/ smaller MAXPHYADDR.
> 
> >> Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
> >> is invalid/incorrect. how can we say selftests are at fault and we should fix
> >> them?
> >
> >In this case, the CPU is at fault, and you should complain to the CPU vendor.
> 
> Yeah, I agree with you and will check with related team inside Intel.

I agree that the CPU is being weird, but this is technically an architecturally
legal configuration, and KVM has largely committed to supporting weird setups.
At some point we have to draw a line when things get too ridiculous, but I don't
think this particular oddity crosses into absurd territory.

> My point was just this isn't a selftest issue because not all information is
> disclosed to the tests.

Ah, right, EPT capabilities are in MSRs that userspace can't read.

> And I am afraid KVM as L1 VMM may run into this situation, i.e., only 4-level
> EPT is supported but MAXPHYADDR is 52. So, KVM needs a fix anyway.

Yes, but forcing emulation for a funky setup is not a good fix.  KVM can simply
constrain the advertised MAXPHYADDR, no? 

---
 arch/x86/kvm/cpuid.c   | 17 +++++++++++++----
 arch/x86/kvm/mmu.h     |  2 ++
 arch/x86/kvm/mmu/mmu.c |  5 +++++
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 294e5bd5f8a0..5c346e1a10bd 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1233,12 +1233,21 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		 *
 		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
 		 * provided, use the raw bare metal MAXPHYADDR as reductions to
-		 * the HPAs do not affect GPAs.
+		 * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
+		 * doesn't support 5-level paging, cap guest MAXPHYADDR at 48
+		 * bits as KVM can't install SPTEs for larger GPAs.
 		 */
-		if (!tdp_enabled)
+		if (!tdp_enabled) {
 			g_phys_as = boot_cpu_data.x86_phys_bits;
-		else if (!g_phys_as)
-			g_phys_as = phys_as;
+		} else {
+			u8 max_tdp_level = kvm_mmu_get_max_tdp_level();
+
+			if (!g_phys_as)
+				g_phys_as = phys_as;
+
+			if (max_tdp_level < 5)
+				g_phys_as = min(g_phys_as, 48);
+		}
 
 		entry->eax = g_phys_as | (virt_as << 8);
 		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60f21bb4c27b..b410a227c601 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
 	return boot_cpu_data.x86_phys_bits;
 }
 
+u8 kvm_mmu_get_max_tdp_level(void);
+
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c844e428684..b2845f5520b3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
 	return max_tdp_level;
 }
 
+u8 kvm_mmu_get_max_tdp_level(void)
+{
+	return tdp_root_level ? tdp_root_level : max_tdp_level;
+}
+
 static union kvm_mmu_page_role
 kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 				union kvm_cpu_role cpu_role)

base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
--
Xiaoyao Li Dec. 20, 2023, 7:16 a.m. UTC | #7
On 12/19/2023 11:26 PM, Sean Christopherson wrote:
> KVM can simply
> constrain the advertised MAXPHYADDR, no?

Sean. It looks you agree with this patch (Patch 1) now.

I think it's better for you to comment the original code from Tao 
instead of throwing out your own version. Tao needs to know why his 
version is not OK/correct and what can be improved.

Thanks,
-Xiaoyao

> ---
>   arch/x86/kvm/cpuid.c   | 17 +++++++++++++----
>   arch/x86/kvm/mmu.h     |  2 ++
>   arch/x86/kvm/mmu/mmu.c |  5 +++++
>   3 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 294e5bd5f8a0..5c346e1a10bd 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1233,12 +1233,21 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>   		 *
>   		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
>   		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> -		 * the HPAs do not affect GPAs.
> +		 * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
> +		 * doesn't support 5-level paging, cap guest MAXPHYADDR at 48
> +		 * bits as KVM can't install SPTEs for larger GPAs.
>   		 */
> -		if (!tdp_enabled)
> +		if (!tdp_enabled) {
>   			g_phys_as = boot_cpu_data.x86_phys_bits;
> -		else if (!g_phys_as)
> -			g_phys_as = phys_as;
> +		} else {
> +			u8 max_tdp_level = kvm_mmu_get_max_tdp_level();
> +
> +			if (!g_phys_as)
> +				g_phys_as = phys_as;
> +
> +			if (max_tdp_level < 5)
> +				g_phys_as = min(g_phys_as, 48);
> +		}
>   
>   		entry->eax = g_phys_as | (virt_as << 8);
>   		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..b410a227c601 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>   	return boot_cpu_data.x86_phys_bits;
>   }
>   
> +u8 kvm_mmu_get_max_tdp_level(void);
> +
>   void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>   void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>   void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..b2845f5520b3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>   	return max_tdp_level;
>   }
>   
> +u8 kvm_mmu_get_max_tdp_level(void)
> +{
> +	return tdp_root_level ? tdp_root_level : max_tdp_level;
> +}
> +
>   static union kvm_mmu_page_role
>   kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>   				union kvm_cpu_role cpu_role)
> 
> base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
> --
Tao Su Dec. 20, 2023, 11:59 a.m. UTC | #8
On Tue, Dec 19, 2023 at 07:26:00AM -0800, Sean Christopherson wrote:
> On Tue, Dec 19, 2023, Chao Gao wrote:
> > On Mon, Dec 18, 2023 at 07:40:11PM -0800, Jim Mattson wrote:
> > >Honestly, I think KVM should just disable EPT if the EPT tables can't
> > >support the CPU's physical address width.
> > 
> > Yes, it is an option.
> > But I prefer to allow admin to override this (i.e., admin still can enable EPT
> > via module parameter) because those issues are not new and disabling EPT
> > doesn't prevent QEMU from launching guests w/ smaller MAXPHYADDR.
> > 
> > >> Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
> > >> is invalid/incorrect. how can we say selftests are at fault and we should fix
> > >> them?
> > >
> > >In this case, the CPU is at fault, and you should complain to the CPU vendor.
> > 
> > Yeah, I agree with you and will check with related team inside Intel.
> 
> I agree that the CPU is being weird, but this is technically an architecturally
> legal configuration, and KVM has largely committed to supporting weird setups.
> At some point we have to draw a line when things get too ridiculous, but I don't
> think this particular oddity crosses into absurd territory.
> 
> > My point was just this isn't a selftest issue because not all information is
> > disclosed to the tests.
> 
> Ah, right, EPT capabilities are in MSRs that userspace can't read.
> 
> > And I am afraid KVM as L1 VMM may run into this situation, i.e., only 4-level
> > EPT is supported but MAXPHYADDR is 52. So, KVM needs a fix anyway.
> 
> Yes, but forcing emulation for a funky setup is not a good fix.  KVM can simply
> constrain the advertised MAXPHYADDR, no? 

GPA is controlled by guest, I.e., just install PTE in guest page table, and the
GPAs beyond 48-bits always trigger EPT violation. If KVM does nothing, guest
can’t get #PF when accessing >MAXPHYADDR, which is inconsistent with
architectural behavior. But doing nothing is also an option because userspace
doesn’t respect the reported value.

Thanks,
Tao
Jim Mattson Dec. 20, 2023, 1:39 p.m. UTC | #9
On Tue, Dec 19, 2023 at 7:26 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Dec 19, 2023, Chao Gao wrote:
> > On Mon, Dec 18, 2023 at 07:40:11PM -0800, Jim Mattson wrote:
> > >Honestly, I think KVM should just disable EPT if the EPT tables can't
> > >support the CPU's physical address width.
> >
> > Yes, it is an option.
> > But I prefer to allow admin to override this (i.e., admin still can enable EPT
> > via module parameter) because those issues are not new and disabling EPT
> > doesn't prevent QEMU from launching guests w/ smaller MAXPHYADDR.
> >
> > >> Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
> > >> is invalid/incorrect. how can we say selftests are at fault and we should fix
> > >> them?
> > >
> > >In this case, the CPU is at fault, and you should complain to the CPU vendor.
> >
> > Yeah, I agree with you and will check with related team inside Intel.
>
> I agree that the CPU is being weird, but this is technically an architecturally
> legal configuration, and KVM has largely committed to supporting weird setups.
> At some point we have to draw a line when things get too ridiculous, but I don't
> think this particular oddity crosses into absurd territory.
>
> > My point was just this isn't a selftest issue because not all information is
> > disclosed to the tests.
>
> Ah, right, EPT capabilities are in MSRs that userspace can't read.
>
> > And I am afraid KVM as L1 VMM may run into this situation, i.e., only 4-level
> > EPT is supported but MAXPHYADDR is 52. So, KVM needs a fix anyway.
>
> Yes, but forcing emulation for a funky setup is not a good fix.  KVM can simply
> constrain the advertised MAXPHYADDR, no?

This is essentially the same as allow_smaller_maxphyaddr, and has the
same issues:
1) MOV-to-CR3 must be intercepted in PAE mode, to check the bits above
guest.MAXPHYADDR in the PDPTRs.
2) #PF must be intercepted whenever PTEs are 64 bits, to do a software
page walk of the guest's x86 page tables and check the bits above
guest.MAXPHYADDR for a terminal page fault that might result in a
different error code than the one produced by hardware.
3) EPT violations may require instruction emulation to synthesize an
appropriate #PF, and this requires the KVM instruction emulator to be
expanded to understand *all* memory-accessing instructions, not just
the limited set it understands now.

I just don't see how this is even remotely practical.

> ---
>  arch/x86/kvm/cpuid.c   | 17 +++++++++++++----
>  arch/x86/kvm/mmu.h     |  2 ++
>  arch/x86/kvm/mmu/mmu.c |  5 +++++
>  3 files changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 294e5bd5f8a0..5c346e1a10bd 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1233,12 +1233,21 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>                  *
>                  * If TDP is enabled but an explicit guest MAXPHYADDR is not
>                  * provided, use the raw bare metal MAXPHYADDR as reductions to
> -                * the HPAs do not affect GPAs.
> +                * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
> +                * doesn't support 5-level paging, cap guest MAXPHYADDR at 48
> +                * bits as KVM can't install SPTEs for larger GPAs.
>                  */
> -               if (!tdp_enabled)
> +               if (!tdp_enabled) {
>                         g_phys_as = boot_cpu_data.x86_phys_bits;
> -               else if (!g_phys_as)
> -                       g_phys_as = phys_as;
> +               } else {
> +                       u8 max_tdp_level = kvm_mmu_get_max_tdp_level();
> +
> +                       if (!g_phys_as)
> +                               g_phys_as = phys_as;
> +
> +                       if (max_tdp_level < 5)
> +                               g_phys_as = min(g_phys_as, 48);
> +               }
>
>                 entry->eax = g_phys_as | (virt_as << 8);
>                 entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..b410a227c601 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>         return boot_cpu_data.x86_phys_bits;
>  }
>
> +u8 kvm_mmu_get_max_tdp_level(void);
> +
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>  void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>  void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..b2845f5520b3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>         return max_tdp_level;
>  }
>
> +u8 kvm_mmu_get_max_tdp_level(void)
> +{
> +       return tdp_root_level ? tdp_root_level : max_tdp_level;
> +}
> +
>  static union kvm_mmu_page_role
>  kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>                                 union kvm_cpu_role cpu_role)
>
> base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
> --
Sean Christopherson Dec. 20, 2023, 3:37 p.m. UTC | #10
On Wed, Dec 20, 2023, Xiaoyao Li wrote:
> On 12/19/2023 11:26 PM, Sean Christopherson wrote:
> > KVM can simply
> > constrain the advertised MAXPHYADDR, no?
> 
> Sean. It looks you agree with this patch (Patch 1) now.

Doh, my apologies.  I don't think I even got to the code, I reacted to the shortlog,
changelog, and to patch 2.  I'll re-respond.
Sean Christopherson Dec. 20, 2023, 4:28 p.m. UTC | #11
As evidenced by my initial response, the shortlog is a bit misleading.  In non-KVM
code it would be perfectly ok to say "limit", but in KVM's split world where
*userspace* is mostly responsible for the guest configuration, "limit guest ..."
is often going to be read as "limit the capabilities of the guest".

This is also a good example of why shortlogs and changelogs should NOT be play-by-play
descriptions of the code change.  The literal code change applies a limit to
guest physical bits, but that doesn't capture the net effect of applying said limit.

Something like

  KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported

better captures that the patch affects what KVM advertises to userspace.  Yeah,
it's technically inaccurate to say "52-bit", but I think 52-bit MAXPHYADDR is
ubiquitous enough that it's worth being technically wrong in order to clearly
communicate the net effect.  Alternatively, it could be something like:

  KVM: x86: Don't advertise incompatible MAXPHYADDR if 5-level TDP is unsupported

On Mon, Dec 18, 2023, Tao Su wrote:
> When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> address must all be zero, otherwise an EPT violation always occurs and
> current handler can't resolve this if the gpa is in RAM region. Hence,
> instruction will keep being executed repeatedly, which causes infinite
> EPT violation.
> 
> Six KVM selftests are timeout due to this issue:
>     kvm:access_tracking_perf_test
>     kvm:demand_paging_test
>     kvm:dirty_log_test
>     kvm:dirty_log_perf_test
>     kvm:kvm_page_table_test
>     kvm:memslot_modification_stress_test
> 
> The above selftests add a RAM region close to max_gfn, if host has 52
> physical bits but doesn't support 5-level EPT, these will trigger infinite
> EPT violation when access the RAM region.
> 
> Since current Intel CPUID doesn't report max guest physical bits like AMD,
> introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> enabled and report the max guest physical bits which is smaller than host.
> 
> When guest physical bits is smaller than host, some GPA are illegal from
> guest's perspective, but are still legal from hardware's perspective,
> which should be trapped to inject #PF. Current KVM already has a parameter
> allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> is enabled and guest accesses an illegal address from guest's perspective,
> KVM will utilize EPT violation and emulate the instruction to inject #PF
> and determine #PF error code.

There is far too much unnecessary cruft in this changelog.  The entire last
paragraph is extraneous information.  Talking in detail about KVM's (flawed)
emulation of smaller guest.MAXPHYADDR isn't at all relevant as to whether or not
it's correct for KVM to advertise an impossible configuration.

And this is exactly why I dislike the "explain the symptom, then the solution"
style for KVM.  The symptoms described above don't actually explain why *KVM* is
at fault.

  KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported

  Cap the effective guest.MAXPHYADDR that KVM advertises to userspace at
  48 bits if 5-level TDP isn't supported, as bits 51:49 are consumed by the
  CPU during a page table walk if and only if 5-level TDP is enabled.  I.e.
  it's impossible for KVM to actually map GPAs with bits 51:49 set, which
  results in vCPUs getting stuck on endless EPT violations.

  From Intel's SDM:

    4-level EPT accesses at most 4 EPT paging-structure entries (an EPT page-
    walk length of 4) to translate a guest-physical address and uses only
    bits 47:0 of each guest-physical address. In contrast, 5-level EPT may
    access up to 5 EPT paging-structure entries (an EPT page-walk length of 5)
    and uses guest-physical address bits 56:0. [Physical addresses and
    guest-physical addresses are architecturally limited to 52 bits (e.g.,
    by paging), so in practice bits 56:52 are zero.]

  While it's decidedly odd for a CPU to support a 52-bit MAXPHYADDR but
  not 5-level EPT, the combination is architecturally legal and such CPUs
  do exist (and can easily be "created" with nested virtualization).

  Note, because EPT capabilities are reported via MSRs, it's impossible
  for userspace to avoid the funky setup, i.e. advertising a sane MAXPHYADDR
  is 100% KVM's responsibility.

> Reported-by: Yi Lai <yi1.lai@intel.com>
> Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> Tested-by: Yi Lai <yi1.lai@intel.com>
> Tested-by: Xudong Hao <xudong.hao@intel.com>
> ---
>  arch/x86/kvm/cpuid.c   | 5 +++--
>  arch/x86/kvm/mmu.h     | 1 +
>  arch/x86/kvm/mmu/mmu.c | 7 +++++++
>  3 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index dda6fc4cfae8..91933ca739ad 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1212,12 +1212,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  		 *
>  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
>  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> -		 * the HPAs do not affect GPAs.
> +		 * the HPAs do not affect GPAs, but ensure guest MAXPHYADDR
> +		 * doesn't exceed the bits that TDP can translate.
>  		 */
>  		if (!tdp_enabled)
>  			g_phys_as = boot_cpu_data.x86_phys_bits;
>  		else if (!g_phys_as)
> -			g_phys_as = phys_as;
> +			g_phys_as = min(phys_as, kvm_mmu_tdp_maxphyaddr());

I think KVM should be paranoid and cap the advertised MAXPHYADDR even if the CPU
advertises guest.MAXPHYADDR.  Advertising a bad guest.MAXPHYADDR is arguably a
blatant CPU bug, but being paranoid is cheap in this case.

>  		entry->eax = g_phys_as | (virt_as << 8);
>  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index bb8c86eefac0..1c7d649fcf6b 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -115,6 +115,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  				u64 fault_address, char *insn, int insn_len);
>  void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
>  					struct kvm_mmu *mmu);
> +unsigned int kvm_mmu_tdp_maxphyaddr(void);
>  
>  int kvm_mmu_load(struct kvm_vcpu *vcpu);
>  void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c57e181bba21..72634d6b61b2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
>  	reset_guest_paging_metadata(vcpu, mmu);
>  }
>  
> +/* guest-physical-address bits limited by TDP */
> +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> +{
> +	return max_tdp_level == 5 ? 57 : 48;

Using "57" is kinda sorta wrong, e.g. the SDM says:

  Bits 56:52 of each guest-physical address are necessarily zero because
  guest-physical addresses are architecturally limited to 52 bits.

Rather than split hairs over something that doesn't matter, I think it makes sense
for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
is still accurate when tdp_root_level is non-zero).

That also avoids confusion with kvm_mmu_max_gfn(), which deliberately uses the
*host* MAXPHYADDR.

Making max_tdp_level visible isn't ideal, but tdp_enabled is already visible and
used by the CPUID code, so it's not the end of the world.

> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_tdp_maxphyaddr);

This shouldn't be exported, I don't see any reason for vendor code to need access
to this helper.  It's essentially a moot point though if we avoid the helper in
the first place.

All in all, this?

---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/cpuid.c            | 16 ++++++++++++----
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7bc1daf68741..29b575b86912 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1932,6 +1932,7 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
 			     bool mask);
 
 extern bool tdp_enabled;
+extern int max_tdp_level;
 
 u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 294e5bd5f8a0..637a1f388a51 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1233,12 +1233,20 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		 *
 		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
 		 * provided, use the raw bare metal MAXPHYADDR as reductions to
-		 * the HPAs do not affect GPAs.
+		 * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
+		 * doesn't support 5-level TDP, cap guest MAXPHYADDR at 48 bits
+		 * as bits 51:49 are used by the CPU if and only if 5-level TDP
+		 * is enabled, i.e. KVM can't map such GPAs with 4-level TDP.
 		 */
-		if (!tdp_enabled)
+		if (!tdp_enabled) {
 			g_phys_as = boot_cpu_data.x86_phys_bits;
-		else if (!g_phys_as)
-			g_phys_as = phys_as;
+		} else {
+			if (!g_phys_as)
+				g_phys_as = phys_as;
+
+			if (max_tdp_level < 5)
+				g_phys_as = min(g_phys_as, 48);
+		}
 
 		entry->eax = g_phys_as | (virt_as << 8);
 		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c844e428684..5036c7eb7dac 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -114,7 +114,7 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
 
 static int max_huge_page_level __read_mostly;
 static int tdp_root_level __read_mostly;
-static int max_tdp_level __read_mostly;
+int max_tdp_level __read_mostly;
 
 #define PTE_PREFETCH_NUM		8
 

base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
--
Tao Su Dec. 21, 2023, 7:45 a.m. UTC | #12
On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> As evidenced by my initial response, the shortlog is a bit misleading.  In non-KVM
> code it would be perfectly ok to say "limit", but in KVM's split world where
> *userspace* is mostly responsible for the guest configuration, "limit guest ..."
> is often going to be read as "limit the capabilities of the guest".
> 
> This is also a good example of why shortlogs and changelogs should NOT be play-by-play
> descriptions of the code change.  The literal code change applies a limit to
> guest physical bits, but that doesn't capture the net effect of applying said limit.
> 
> Something like
> 
>   KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported
> 
> better captures that the patch affects what KVM advertises to userspace.  Yeah,
> it's technically inaccurate to say "52-bit", but I think 52-bit MAXPHYADDR is
> ubiquitous enough that it's worth being technically wrong in order to clearly
> communicate the net effect.  Alternatively, it could be something like:
> 
>   KVM: x86: Don't advertise incompatible MAXPHYADDR if 5-level TDP is unsupported

Yeah, these two shortlogs are more accurate and intuitive.

> 
> On Mon, Dec 18, 2023, Tao Su wrote:
> > When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> > address must all be zero, otherwise an EPT violation always occurs and
> > current handler can't resolve this if the gpa is in RAM region. Hence,
> > instruction will keep being executed repeatedly, which causes infinite
> > EPT violation.
> > 
> > Six KVM selftests are timeout due to this issue:
> >     kvm:access_tracking_perf_test
> >     kvm:demand_paging_test
> >     kvm:dirty_log_test
> >     kvm:dirty_log_perf_test
> >     kvm:kvm_page_table_test
> >     kvm:memslot_modification_stress_test
> > 
> > The above selftests add a RAM region close to max_gfn, if host has 52
> > physical bits but doesn't support 5-level EPT, these will trigger infinite
> > EPT violation when access the RAM region.
> > 
> > Since current Intel CPUID doesn't report max guest physical bits like AMD,
> > introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> > enabled and report the max guest physical bits which is smaller than host.
> > 
> > When guest physical bits is smaller than host, some GPA are illegal from
> > guest's perspective, but are still legal from hardware's perspective,
> > which should be trapped to inject #PF. Current KVM already has a parameter
> > allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> > host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> > can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> > is enabled and guest accesses an illegal address from guest's perspective,
> > KVM will utilize EPT violation and emulate the instruction to inject #PF
> > and determine #PF error code.
> 
> There is far too much unnecessary cruft in this changelog.  The entire last
> paragraph is extraneous information.  Talking in detail about KVM's (flawed)
> emulation of smaller guest.MAXPHYADDR isn't at all relevant as to whether or not
> it's correct for KVM to advertise an impossible configuration.
> 
> And this is exactly why I dislike the "explain the symptom, then the solution"
> style for KVM.  The symptoms described above don't actually explain why *KVM* is
> at fault.
> 
>   KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported
> 
>   Cap the effective guest.MAXPHYADDR that KVM advertises to userspace at
>   48 bits if 5-level TDP isn't supported, as bits 51:49 are consumed by the
>   CPU during a page table walk if and only if 5-level TDP is enabled.  I.e.
>   it's impossible for KVM to actually map GPAs with bits 51:49 set, which
>   results in vCPUs getting stuck on endless EPT violations.
> 
>   From Intel's SDM:
> 
>     4-level EPT accesses at most 4 EPT paging-structure entries (an EPT page-
>     walk length of 4) to translate a guest-physical address and uses only
>     bits 47:0 of each guest-physical address. In contrast, 5-level EPT may
>     access up to 5 EPT paging-structure entries (an EPT page-walk length of 5)
>     and uses guest-physical address bits 56:0. [Physical addresses and
>     guest-physical addresses are architecturally limited to 52 bits (e.g.,
>     by paging), so in practice bits 56:52 are zero.]
> 
>   While it's decidedly odd for a CPU to support a 52-bit MAXPHYADDR but
>   not 5-level EPT, the combination is architecturally legal and such CPUs
>   do exist (and can easily be "created" with nested virtualization).
> 
>   Note, because EPT capabilities are reported via MSRs, it's impossible
>   for userspace to avoid the funky setup, i.e. advertising a sane MAXPHYADDR
>   is 100% KVM's responsibility.

Yes, I learn a lot, I will use the above shortlogs and changelogs as my
next version.

> 
> > Reported-by: Yi Lai <yi1.lai@intel.com>
> > Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> > Tested-by: Yi Lai <yi1.lai@intel.com>
> > Tested-by: Xudong Hao <xudong.hao@intel.com>
> > ---
> >  arch/x86/kvm/cpuid.c   | 5 +++--
> >  arch/x86/kvm/mmu.h     | 1 +
> >  arch/x86/kvm/mmu/mmu.c | 7 +++++++
> >  3 files changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index dda6fc4cfae8..91933ca739ad 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -1212,12 +1212,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> >  		 *
> >  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
> >  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> > -		 * the HPAs do not affect GPAs.
> > +		 * the HPAs do not affect GPAs, but ensure guest MAXPHYADDR
> > +		 * doesn't exceed the bits that TDP can translate.
> >  		 */
> >  		if (!tdp_enabled)
> >  			g_phys_as = boot_cpu_data.x86_phys_bits;
> >  		else if (!g_phys_as)
> > -			g_phys_as = phys_as;
> > +			g_phys_as = min(phys_as, kvm_mmu_tdp_maxphyaddr());
> 
> I think KVM should be paranoid and cap the advertised MAXPHYADDR even if the CPU
> advertises guest.MAXPHYADDR.  Advertising a bad guest.MAXPHYADDR is arguably a
> blatant CPU bug, but being paranoid is cheap in this case.

Yes, agree.

> 
> >  		entry->eax = g_phys_as | (virt_as << 8);
> >  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index bb8c86eefac0..1c7d649fcf6b 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -115,6 +115,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  				u64 fault_address, char *insn, int insn_len);
> >  void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >  					struct kvm_mmu *mmu);
> > +unsigned int kvm_mmu_tdp_maxphyaddr(void);
> >  
> >  int kvm_mmu_load(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index c57e181bba21..72634d6b61b2 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >  	reset_guest_paging_metadata(vcpu, mmu);
> >  }
> >  
> > +/* guest-physical-address bits limited by TDP */
> > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > +{
> > +	return max_tdp_level == 5 ? 57 : 48;
> 
> Using "57" is kinda sorta wrong, e.g. the SDM says:
> 
>   Bits 56:52 of each guest-physical address are necessarily zero because
>   guest-physical addresses are architecturally limited to 52 bits.
> 
> Rather than split hairs over something that doesn't matter, I think it makes sense
> for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> is still accurate when tdp_root_level is non-zero).
> 
> That also avoids confusion with kvm_mmu_max_gfn(), which deliberately uses the
> *host* MAXPHYADDR.
> 
> Making max_tdp_level visible isn't ideal, but tdp_enabled is already visible and
> used by the CPUID code, so it's not the end of the world.

Yes, agree.

> 
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_mmu_tdp_maxphyaddr);
> 
> This shouldn't be exported, I don't see any reason for vendor code to need access
> to this helper.  It's essentially a moot point though if we avoid the helper in
> the first place.

Yes, previously I want to invoke this helper in the patch2, now it is no
longer needed.

> 
> All in all, this?

The code is perfect, I will add it to my next version, thanks!

> 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/cpuid.c            | 16 ++++++++++++----
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  3 files changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7bc1daf68741..29b575b86912 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1932,6 +1932,7 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
>  			     bool mask);
>  
>  extern bool tdp_enabled;
> +extern int max_tdp_level;
>  
>  u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>  
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 294e5bd5f8a0..637a1f388a51 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1233,12 +1233,20 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  		 *
>  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
>  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> -		 * the HPAs do not affect GPAs.
> +		 * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
> +		 * doesn't support 5-level TDP, cap guest MAXPHYADDR at 48 bits
> +		 * as bits 51:49 are used by the CPU if and only if 5-level TDP

Nit, it should be bits 51:48?

Thanks,
Tao

> +		 * is enabled, i.e. KVM can't map such GPAs with 4-level TDP.
>  		 */
> -		if (!tdp_enabled)
> +		if (!tdp_enabled) {
>  			g_phys_as = boot_cpu_data.x86_phys_bits;
> -		else if (!g_phys_as)
> -			g_phys_as = phys_as;
> +		} else {
> +			if (!g_phys_as)
> +				g_phys_as = phys_as;
> +
> +			if (max_tdp_level < 5)
> +				g_phys_as = min(g_phys_as, 48);
> +		}
>  
>  		entry->eax = g_phys_as | (virt_as << 8);
>  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..5036c7eb7dac 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -114,7 +114,7 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
>  
>  static int max_huge_page_level __read_mostly;
>  static int tdp_root_level __read_mostly;
> -static int max_tdp_level __read_mostly;
> +int max_tdp_level __read_mostly;
>  
>  #define PTE_PREFETCH_NUM		8
>  
> 
> base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
> -- 
>
Xu Yilun Dec. 21, 2023, 8:19 a.m. UTC | #13
On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> As evidenced by my initial response, the shortlog is a bit misleading.  In non-KVM
> code it would be perfectly ok to say "limit", but in KVM's split world where
> *userspace* is mostly responsible for the guest configuration, "limit guest ..."
> is often going to be read as "limit the capabilities of the guest".
> 
> This is also a good example of why shortlogs and changelogs should NOT be play-by-play
> descriptions of the code change.  The literal code change applies a limit to
> guest physical bits, but that doesn't capture the net effect of applying said limit.
> 
> Something like
> 
>   KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported
> 
> better captures that the patch affects what KVM advertises to userspace.  Yeah,
> it's technically inaccurate to say "52-bit", but I think 52-bit MAXPHYADDR is
> ubiquitous enough that it's worth being technically wrong in order to clearly
> communicate the net effect.  Alternatively, it could be something like:
> 
>   KVM: x86: Don't advertise incompatible MAXPHYADDR if 5-level TDP is unsupported
> 
> On Mon, Dec 18, 2023, Tao Su wrote:
> > When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> > address must all be zero, otherwise an EPT violation always occurs and
> > current handler can't resolve this if the gpa is in RAM region. Hence,
> > instruction will keep being executed repeatedly, which causes infinite
> > EPT violation.
> > 
> > Six KVM selftests are timeout due to this issue:
> >     kvm:access_tracking_perf_test
> >     kvm:demand_paging_test
> >     kvm:dirty_log_test
> >     kvm:dirty_log_perf_test
> >     kvm:kvm_page_table_test
> >     kvm:memslot_modification_stress_test
> > 
> > The above selftests add a RAM region close to max_gfn, if host has 52
> > physical bits but doesn't support 5-level EPT, these will trigger infinite
> > EPT violation when access the RAM region.
> > 
> > Since current Intel CPUID doesn't report max guest physical bits like AMD,
> > introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> > enabled and report the max guest physical bits which is smaller than host.
> > 
> > When guest physical bits is smaller than host, some GPA are illegal from
> > guest's perspective, but are still legal from hardware's perspective,
> > which should be trapped to inject #PF. Current KVM already has a parameter
> > allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> > host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> > can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> > is enabled and guest accesses an illegal address from guest's perspective,
> > KVM will utilize EPT violation and emulate the instruction to inject #PF
> > and determine #PF error code.
> 
> There is far too much unnecessary cruft in this changelog.  The entire last
> paragraph is extraneous information.  Talking in detail about KVM's (flawed)
> emulation of smaller guest.MAXPHYADDR isn't at all relevant as to whether or not
> it's correct for KVM to advertise an impossible configuration.
> 
> And this is exactly why I dislike the "explain the symptom, then the solution"
> style for KVM.  The symptoms described above don't actually explain why *KVM* is
> at fault.
> 
>   KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported
> 
>   Cap the effective guest.MAXPHYADDR that KVM advertises to userspace at
>   48 bits if 5-level TDP isn't supported, as bits 51:49 are consumed by the
>   CPU during a page table walk if and only if 5-level TDP is enabled.  I.e.
>   it's impossible for KVM to actually map GPAs with bits 51:49 set, which
>   results in vCPUs getting stuck on endless EPT violations.
> 
>   From Intel's SDM:
> 
>     4-level EPT accesses at most 4 EPT paging-structure entries (an EPT page-
>     walk length of 4) to translate a guest-physical address and uses only
>     bits 47:0 of each guest-physical address. In contrast, 5-level EPT may
>     access up to 5 EPT paging-structure entries (an EPT page-walk length of 5)
>     and uses guest-physical address bits 56:0. [Physical addresses and
>     guest-physical addresses are architecturally limited to 52 bits (e.g.,
>     by paging), so in practice bits 56:52 are zero.]
> 
>   While it's decidedly odd for a CPU to support a 52-bit MAXPHYADDR but
>   not 5-level EPT, the combination is architecturally legal and such CPUs
>   do exist (and can easily be "created" with nested virtualization).
> 
>   Note, because EPT capabilities are reported via MSRs, it's impossible
>   for userspace to avoid the funky setup, i.e. advertising a sane MAXPHYADDR
>   is 100% KVM's responsibility.
> 
> > Reported-by: Yi Lai <yi1.lai@intel.com>
> > Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> > Tested-by: Yi Lai <yi1.lai@intel.com>
> > Tested-by: Xudong Hao <xudong.hao@intel.com>
> > ---
> >  arch/x86/kvm/cpuid.c   | 5 +++--
> >  arch/x86/kvm/mmu.h     | 1 +
> >  arch/x86/kvm/mmu/mmu.c | 7 +++++++
> >  3 files changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index dda6fc4cfae8..91933ca739ad 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -1212,12 +1212,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> >  		 *
> >  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
> >  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> > -		 * the HPAs do not affect GPAs.
> > +		 * the HPAs do not affect GPAs, but ensure guest MAXPHYADDR
> > +		 * doesn't exceed the bits that TDP can translate.
> >  		 */
> >  		if (!tdp_enabled)
> >  			g_phys_as = boot_cpu_data.x86_phys_bits;
> >  		else if (!g_phys_as)
> > -			g_phys_as = phys_as;
> > +			g_phys_as = min(phys_as, kvm_mmu_tdp_maxphyaddr());
> 
> I think KVM should be paranoid and cap the advertised MAXPHYADDR even if the CPU
> advertises guest.MAXPHYADDR.  Advertising a bad guest.MAXPHYADDR is arguably a
> blatant CPU bug, but being paranoid is cheap in this case.
> 
> >  		entry->eax = g_phys_as | (virt_as << 8);
> >  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index bb8c86eefac0..1c7d649fcf6b 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -115,6 +115,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  				u64 fault_address, char *insn, int insn_len);
> >  void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >  					struct kvm_mmu *mmu);
> > +unsigned int kvm_mmu_tdp_maxphyaddr(void);
> >  
> >  int kvm_mmu_load(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index c57e181bba21..72634d6b61b2 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >  	reset_guest_paging_metadata(vcpu, mmu);
> >  }
> >  
> > +/* guest-physical-address bits limited by TDP */
> > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > +{
> > +	return max_tdp_level == 5 ? 57 : 48;
> 
> Using "57" is kinda sorta wrong, e.g. the SDM says:
> 
>   Bits 56:52 of each guest-physical address are necessarily zero because
>   guest-physical addresses are architecturally limited to 52 bits.
> 
> Rather than split hairs over something that doesn't matter, I think it makes sense
> for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> is still accurate when tdp_root_level is non-zero).

It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
max_tdp_level:

	kvm_configure_mmu(npt_enabled, get_npt_level(),
			  get_npt_level(), PG_LEVEL_1G);

But I wanna doulbe confirm if directly using max_tdp_level is fully
considered.  In your last proposal, it is:

  u8 kvm_mmu_get_max_tdp_level(void)
  {
	return tdp_root_level ? tdp_root_level : max_tdp_level;
  }

and I think it makes more sense, because EPT setup follows the same
rule.  If any future architechture sets tdp_root_level smaller than
max_tdp_level, the issue will happen again.

Thanks,
Yilun

> 
> That also avoids confusion with kvm_mmu_max_gfn(), which deliberately uses the
> *host* MAXPHYADDR.
> 
> Making max_tdp_level visible isn't ideal, but tdp_enabled is already visible and
> used by the CPUID code, so it's not the end of the world.
> 
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_mmu_tdp_maxphyaddr);
> 
> This shouldn't be exported, I don't see any reason for vendor code to need access
> to this helper.  It's essentially a moot point though if we avoid the helper in
> the first place.
> 
> All in all, this?
> 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/cpuid.c            | 16 ++++++++++++----
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  3 files changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7bc1daf68741..29b575b86912 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1932,6 +1932,7 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
>  			     bool mask);
>  
>  extern bool tdp_enabled;
> +extern int max_tdp_level;
>  
>  u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>  
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 294e5bd5f8a0..637a1f388a51 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1233,12 +1233,20 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  		 *
>  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
>  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> -		 * the HPAs do not affect GPAs.
> +		 * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
> +		 * doesn't support 5-level TDP, cap guest MAXPHYADDR at 48 bits
> +		 * as bits 51:49 are used by the CPU if and only if 5-level TDP
> +		 * is enabled, i.e. KVM can't map such GPAs with 4-level TDP.
>  		 */
> -		if (!tdp_enabled)
> +		if (!tdp_enabled) {
>  			g_phys_as = boot_cpu_data.x86_phys_bits;
> -		else if (!g_phys_as)
> -			g_phys_as = phys_as;
> +		} else {
> +			if (!g_phys_as)
> +				g_phys_as = phys_as;
> +
> +			if (max_tdp_level < 5)
> +				g_phys_as = min(g_phys_as, 48);
> +		}
>  
>  		entry->eax = g_phys_as | (virt_as << 8);
>  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..5036c7eb7dac 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -114,7 +114,7 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
>  
>  static int max_huge_page_level __read_mostly;
>  static int tdp_root_level __read_mostly;
> -static int max_tdp_level __read_mostly;
> +int max_tdp_level __read_mostly;
>  
>  #define PTE_PREFETCH_NUM		8
>  
> 
> base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
> -- 
> 
>
Sean Christopherson Jan. 2, 2024, 11:24 p.m. UTC | #14
On Thu, Dec 21, 2023, Xu Yilun wrote:
> On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index c57e181bba21..72634d6b61b2 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > >  	reset_guest_paging_metadata(vcpu, mmu);
> > >  }
> > >  
> > > +/* guest-physical-address bits limited by TDP */
> > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > > +{
> > > +	return max_tdp_level == 5 ? 57 : 48;
> > 
> > Using "57" is kinda sorta wrong, e.g. the SDM says:
> > 
> >   Bits 56:52 of each guest-physical address are necessarily zero because
> >   guest-physical addresses are architecturally limited to 52 bits.
> > 
> > Rather than split hairs over something that doesn't matter, I think it makes sense
> > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> > is still accurate when tdp_root_level is non-zero).
> 
> It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> max_tdp_level:
> 
> 	kvm_configure_mmu(npt_enabled, get_npt_level(),
> 			  get_npt_level(), PG_LEVEL_1G);
> 
> But I wanna doulbe confirm if directly using max_tdp_level is fully
> considered.  In your last proposal, it is:
> 
>   u8 kvm_mmu_get_max_tdp_level(void)
>   {
> 	return tdp_root_level ? tdp_root_level : max_tdp_level;
>   }
> 
> and I think it makes more sense, because EPT setup follows the same
> rule.  If any future architechture sets tdp_root_level smaller than
> max_tdp_level, the issue will happen again.

Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
really means "max possible TDP level KVM can use".  If an exact TDP level is being
forced by tdp_root_level, then by definition it's also the max TDP level, because
it's the _only_ TDP level KVM supports.
Jim Mattson Jan. 3, 2024, 12:34 a.m. UTC | #15
On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Dec 21, 2023, Xu Yilun wrote:
> > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index c57e181bba21..72634d6b61b2 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > > >   reset_guest_paging_metadata(vcpu, mmu);
> > > >  }
> > > >
> > > > +/* guest-physical-address bits limited by TDP */
> > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > > > +{
> > > > + return max_tdp_level == 5 ? 57 : 48;
> > >
> > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> > >
> > >   Bits 56:52 of each guest-physical address are necessarily zero because
> > >   guest-physical addresses are architecturally limited to 52 bits.
> > >
> > > Rather than split hairs over something that doesn't matter, I think it makes sense
> > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> > > is still accurate when tdp_root_level is non-zero).
> >
> > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> > max_tdp_level:
> >
> >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> >                         get_npt_level(), PG_LEVEL_1G);
> >
> > But I wanna doulbe confirm if directly using max_tdp_level is fully
> > considered.  In your last proposal, it is:
> >
> >   u8 kvm_mmu_get_max_tdp_level(void)
> >   {
> >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> >   }
> >
> > and I think it makes more sense, because EPT setup follows the same
> > rule.  If any future architechture sets tdp_root_level smaller than
> > max_tdp_level, the issue will happen again.
>
> Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> really means "max possible TDP level KVM can use".  If an exact TDP level is being
> forced by tdp_root_level, then by definition it's also the max TDP level, because
> it's the _only_ TDP level KVM supports.

This is all just so broken and wrong. The only guest.MAXPHYADDR that
can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
every #PF, and to emulate the faulting instruction to see if the RSVD
bit should be set in the error code. Hardware isn't going to do it.
Since some page faults may occur in CPL3, this means that KVM has to
be prepared to emulate any memory-accessing instruction. That's not
practical.

Basically, a CPU with more than 48 bits of physical address that
doesn't support 5-level EPT really doesn't support EPT at all, except
perhaps in the context of some new paravirtual pinky-swear from the
guest that it doesn't care about the RSVD bit in #PF error codes.
Sean Christopherson Jan. 3, 2024, 6:04 p.m. UTC | #16
On Tue, Jan 02, 2024, Jim Mattson wrote:
> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Dec 21, 2023, Xu Yilun wrote:
> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > index c57e181bba21..72634d6b61b2 100644
> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > > > >   reset_guest_paging_metadata(vcpu, mmu);
> > > > >  }
> > > > >
> > > > > +/* guest-physical-address bits limited by TDP */
> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > > > > +{
> > > > > + return max_tdp_level == 5 ? 57 : 48;
> > > >
> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> > > >
> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
> > > >   guest-physical addresses are architecturally limited to 52 bits.
> > > >
> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> > > > is still accurate when tdp_root_level is non-zero).
> > >
> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> > > max_tdp_level:
> > >
> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> > >                         get_npt_level(), PG_LEVEL_1G);
> > >
> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
> > > considered.  In your last proposal, it is:
> > >
> > >   u8 kvm_mmu_get_max_tdp_level(void)
> > >   {
> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> > >   }
> > >
> > > and I think it makes more sense, because EPT setup follows the same
> > > rule.  If any future architechture sets tdp_root_level smaller than
> > > max_tdp_level, the issue will happen again.
> >
> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
> > forced by tdp_root_level, then by definition it's also the max TDP level, because
> > it's the _only_ TDP level KVM supports.
> 
> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> every #PF, and to emulate the faulting instruction to see if the RSVD
> bit should be set in the error code. Hardware isn't going to do it.
> Since some page faults may occur in CPL3, this means that KVM has to
> be prepared to emulate any memory-accessing instruction. That's not
> practical.
> 
> Basically, a CPU with more than 48 bits of physical address that
> doesn't support 5-level EPT really doesn't support EPT at all, except
> perhaps in the context of some new paravirtual pinky-swear from the
> guest that it doesn't care about the RSVD bit in #PF error codes.

Doh, I managed to forget about the RSVD #PF mess.  That said, this patch will
"work" if userspace enables allow_smaller_maxphyaddr.  In quotes because I'm still
skeptical that allow_smaller_maxphyaddr actually works in all scenarios.  And we'd
need a way to communicate all of that to userspace.  Blech.
Chao Gao Jan. 4, 2024, 2:45 a.m. UTC | #17
On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
>On Tue, Jan 02, 2024, Jim Mattson wrote:
>> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
>> >
>> > On Thu, Dec 21, 2023, Xu Yilun wrote:
>> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
>> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> > > > > index c57e181bba21..72634d6b61b2 100644
>> > > > > --- a/arch/x86/kvm/mmu/mmu.c
>> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
>> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
>> > > > >   reset_guest_paging_metadata(vcpu, mmu);
>> > > > >  }
>> > > > >
>> > > > > +/* guest-physical-address bits limited by TDP */
>> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
>> > > > > +{
>> > > > > + return max_tdp_level == 5 ? 57 : 48;
>> > > >
>> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
>> > > >
>> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
>> > > >   guest-physical addresses are architecturally limited to 52 bits.
>> > > >
>> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
>> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
>> > > > is still accurate when tdp_root_level is non-zero).
>> > >
>> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
>> > > max_tdp_level:
>> > >
>> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
>> > >                         get_npt_level(), PG_LEVEL_1G);
>> > >
>> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
>> > > considered.  In your last proposal, it is:
>> > >
>> > >   u8 kvm_mmu_get_max_tdp_level(void)
>> > >   {
>> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
>> > >   }
>> > >
>> > > and I think it makes more sense, because EPT setup follows the same
>> > > rule.  If any future architechture sets tdp_root_level smaller than
>> > > max_tdp_level, the issue will happen again.
>> >
>> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
>> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
>> > forced by tdp_root_level, then by definition it's also the max TDP level, because
>> > it's the _only_ TDP level KVM supports.
>> 
>> This is all just so broken and wrong. The only guest.MAXPHYADDR that
>> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
>> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
>> every #PF,

in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
RSVD bits 51-48 set leads to EPT violation.

>> and to emulate the faulting instruction to see if the RSVD
>> bit should be set in the error code. Hardware isn't going to do it.

Note for EPT violation VM exits, the CPU stores the GPA that caused this exit
in "guest-physical address" field of VMCS. so, it is not necessary to emulate
the faulting instruction to determine if any RSVD bit is set.

>> Since some page faults may occur in CPL3, this means that KVM has to
>> be prepared to emulate any memory-accessing instruction. That's not
>> practical.

as said above, no need to intercept #PF for this specific case.
Jim Mattson Jan. 4, 2024, 3:40 a.m. UTC | #18
On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
>
> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> >On Tue, Jan 02, 2024, Jim Mattson wrote:
> >> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
> >> >
> >> > On Thu, Dec 21, 2023, Xu Yilun wrote:
> >> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> >> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> > > > > index c57e181bba21..72634d6b61b2 100644
> >> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> >> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> >> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >> > > > >   reset_guest_paging_metadata(vcpu, mmu);
> >> > > > >  }
> >> > > > >
> >> > > > > +/* guest-physical-address bits limited by TDP */
> >> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> >> > > > > +{
> >> > > > > + return max_tdp_level == 5 ? 57 : 48;
> >> > > >
> >> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> >> > > >
> >> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
> >> > > >   guest-physical addresses are architecturally limited to 52 bits.
> >> > > >
> >> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
> >> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> >> > > > is still accurate when tdp_root_level is non-zero).
> >> > >
> >> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> >> > > max_tdp_level:
> >> > >
> >> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> >> > >                         get_npt_level(), PG_LEVEL_1G);
> >> > >
> >> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
> >> > > considered.  In your last proposal, it is:
> >> > >
> >> > >   u8 kvm_mmu_get_max_tdp_level(void)
> >> > >   {
> >> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> >> > >   }
> >> > >
> >> > > and I think it makes more sense, because EPT setup follows the same
> >> > > rule.  If any future architechture sets tdp_root_level smaller than
> >> > > max_tdp_level, the issue will happen again.
> >> >
> >> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> >> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
> >> > forced by tdp_root_level, then by definition it's also the max TDP level, because
> >> > it's the _only_ TDP level KVM supports.
> >>
> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> >> every #PF,
>
> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> RSVD bits 51-48 set leads to EPT violation.

At the completion of the page table walk, if there is a permission
fault, the data address should not be accessed, so there should not be
an EPT violation. Remember Meltdown?

> >> and to emulate the faulting instruction to see if the RSVD
> >> bit should be set in the error code. Hardware isn't going to do it.
>
> Note for EPT violation VM exits, the CPU stores the GPA that caused this exit
> in "guest-physical address" field of VMCS. so, it is not necessary to emulate
> the faulting instruction to determine if any RSVD bit is set.

There should not be an EPT violation in the case discussed.

> >> Since some page faults may occur in CPL3, this means that KVM has to
> >> be prepared to emulate any memory-accessing instruction. That's not
> >> practical.
>
> as said above, no need to intercept #PF for this specific case.

I disagree. See above.
Jim Mattson Jan. 4, 2024, 4:34 a.m. UTC | #19
On Wed, Jan 3, 2024 at 7:40 PM Jim Mattson <jmattson@google.com> wrote:
>
> On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> >
> > On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> > >On Tue, Jan 02, 2024, Jim Mattson wrote:
> > >> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
> > >> >
> > >> > On Thu, Dec 21, 2023, Xu Yilun wrote:
> > >> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> > >> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > >> > > > > index c57e181bba21..72634d6b61b2 100644
> > >> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > >> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > >> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > >> > > > >   reset_guest_paging_metadata(vcpu, mmu);
> > >> > > > >  }
> > >> > > > >
> > >> > > > > +/* guest-physical-address bits limited by TDP */
> > >> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > >> > > > > +{
> > >> > > > > + return max_tdp_level == 5 ? 57 : 48;
> > >> > > >
> > >> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> > >> > > >
> > >> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
> > >> > > >   guest-physical addresses are architecturally limited to 52 bits.
> > >> > > >
> > >> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
> > >> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> > >> > > > is still accurate when tdp_root_level is non-zero).
> > >> > >
> > >> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> > >> > > max_tdp_level:
> > >> > >
> > >> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> > >> > >                         get_npt_level(), PG_LEVEL_1G);
> > >> > >
> > >> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
> > >> > > considered.  In your last proposal, it is:
> > >> > >
> > >> > >   u8 kvm_mmu_get_max_tdp_level(void)
> > >> > >   {
> > >> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> > >> > >   }
> > >> > >
> > >> > > and I think it makes more sense, because EPT setup follows the same
> > >> > > rule.  If any future architechture sets tdp_root_level smaller than
> > >> > > max_tdp_level, the issue will happen again.
> > >> >
> > >> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> > >> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
> > >> > forced by tdp_root_level, then by definition it's also the max TDP level, because
> > >> > it's the _only_ TDP level KVM supports.
> > >>
> > >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> > >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> > >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> > >> every #PF,
> >
> > in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> > 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> > RSVD bits 51-48 set leads to EPT violation.
>
> At the completion of the page table walk, if there is a permission
> fault, the data address should not be accessed, so there should not be
> an EPT violation. Remember Meltdown?
>
> > >> and to emulate the faulting instruction to see if the RSVD
> > >> bit should be set in the error code. Hardware isn't going to do it.
> >
> > Note for EPT violation VM exits, the CPU stores the GPA that caused this exit
> > in "guest-physical address" field of VMCS. so, it is not necessary to emulate
> > the faulting instruction to determine if any RSVD bit is set.
>
> There should not be an EPT violation in the case discussed.

For intercepted #PF, we can use CR2 to determine the necessary page
walk, and presumably the rest of the bits in the error code are
already set, so emulation is not necessary.

However, emulation is necessary when synthesizing a #PF from an EPT
violation, and bit 8 of the exit qualification is clear. See
https://lore.kernel.org/kvm/4463f391-0a25-017e-f913-69c297e13c5e@redhat.com/.

> > >> Since some page faults may occur in CPL3, this means that KVM has to
> > >> be prepared to emulate any memory-accessing instruction. That's not
> > >> practical.
> >
> > as said above, no need to intercept #PF for this specific case.
>
> I disagree. See above.
Tao Su Jan. 4, 2024, 11:56 a.m. UTC | #20
On Wed, Jan 03, 2024 at 08:34:16PM -0800, Jim Mattson wrote:
> On Wed, Jan 3, 2024 at 7:40 PM Jim Mattson <jmattson@google.com> wrote:
> >
> > On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> > >
> > > On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> > > >On Tue, Jan 02, 2024, Jim Mattson wrote:
> > > >> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >> >
> > > >> > On Thu, Dec 21, 2023, Xu Yilun wrote:
> > > >> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> > > >> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > >> > > > > index c57e181bba21..72634d6b61b2 100644
> > > >> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > >> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > >> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > > >> > > > >   reset_guest_paging_metadata(vcpu, mmu);
> > > >> > > > >  }
> > > >> > > > >
> > > >> > > > > +/* guest-physical-address bits limited by TDP */
> > > >> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > > >> > > > > +{
> > > >> > > > > + return max_tdp_level == 5 ? 57 : 48;
> > > >> > > >
> > > >> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> > > >> > > >
> > > >> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
> > > >> > > >   guest-physical addresses are architecturally limited to 52 bits.
> > > >> > > >
> > > >> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
> > > >> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> > > >> > > > is still accurate when tdp_root_level is non-zero).
> > > >> > >
> > > >> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> > > >> > > max_tdp_level:
> > > >> > >
> > > >> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> > > >> > >                         get_npt_level(), PG_LEVEL_1G);
> > > >> > >
> > > >> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
> > > >> > > considered.  In your last proposal, it is:
> > > >> > >
> > > >> > >   u8 kvm_mmu_get_max_tdp_level(void)
> > > >> > >   {
> > > >> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> > > >> > >   }
> > > >> > >
> > > >> > > and I think it makes more sense, because EPT setup follows the same
> > > >> > > rule.  If any future architechture sets tdp_root_level smaller than
> > > >> > > max_tdp_level, the issue will happen again.
> > > >> >
> > > >> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> > > >> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
> > > >> > forced by tdp_root_level, then by definition it's also the max TDP level, because
> > > >> > it's the _only_ TDP level KVM supports.
> > > >>
> > > >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> > > >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> > > >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> > > >> every #PF,
> > >
> > > in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> > > 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> > > RSVD bits 51-48 set leads to EPT violation.
> >
> > At the completion of the page table walk, if there is a permission
> > fault, the data address should not be accessed, so there should not be
> > an EPT violation. Remember Meltdown?
> >
> > > >> and to emulate the faulting instruction to see if the RSVD
> > > >> bit should be set in the error code. Hardware isn't going to do it.
> > >
> > > Note for EPT violation VM exits, the CPU stores the GPA that caused this exit
> > > in "guest-physical address" field of VMCS. so, it is not necessary to emulate
> > > the faulting instruction to determine if any RSVD bit is set.
> >
> > There should not be an EPT violation in the case discussed.
> 
> For intercepted #PF, we can use CR2 to determine the necessary page
> walk, and presumably the rest of the bits in the error code are
> already set, so emulation is not necessary.
> 
> However, emulation is necessary when synthesizing a #PF from an EPT
> violation, and bit 8 of the exit qualification is clear. See
> https://lore.kernel.org/kvm/4463f391-0a25-017e-f913-69c297e13c5e@redhat.com/.

Although not all memory-accessing instructions are emulated, it covers most common
cases and is always better than KVM hangs anyway. We may probably continue to
improve allow_smaller_maxphyaddr, but KVM should report the maximum physical width
it supports.

Thanks,
Tao
Jim Mattson Jan. 4, 2024, 2:03 p.m. UTC | #21
On Thu, Jan 4, 2024 at 3:59 AM Tao Su <tao1.su@linux.intel.com> wrote:
>
> On Wed, Jan 03, 2024 at 08:34:16PM -0800, Jim Mattson wrote:
> > On Wed, Jan 3, 2024 at 7:40 PM Jim Mattson <jmattson@google.com> wrote:
> > >
> > > On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> > > >
> > > > On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> > > > >On Tue, Jan 02, 2024, Jim Mattson wrote:
> > > > >> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > >> >
> > > > >> > On Thu, Dec 21, 2023, Xu Yilun wrote:
> > > > >> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> > > > >> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > >> > > > > index c57e181bba21..72634d6b61b2 100644
> > > > >> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > >> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > >> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > > > >> > > > >   reset_guest_paging_metadata(vcpu, mmu);
> > > > >> > > > >  }
> > > > >> > > > >
> > > > >> > > > > +/* guest-physical-address bits limited by TDP */
> > > > >> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > > > >> > > > > +{
> > > > >> > > > > + return max_tdp_level == 5 ? 57 : 48;
> > > > >> > > >
> > > > >> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> > > > >> > > >
> > > > >> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
> > > > >> > > >   guest-physical addresses are architecturally limited to 52 bits.
> > > > >> > > >
> > > > >> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
> > > > >> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> > > > >> > > > is still accurate when tdp_root_level is non-zero).
> > > > >> > >
> > > > >> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> > > > >> > > max_tdp_level:
> > > > >> > >
> > > > >> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> > > > >> > >                         get_npt_level(), PG_LEVEL_1G);
> > > > >> > >
> > > > >> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
> > > > >> > > considered.  In your last proposal, it is:
> > > > >> > >
> > > > >> > >   u8 kvm_mmu_get_max_tdp_level(void)
> > > > >> > >   {
> > > > >> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> > > > >> > >   }
> > > > >> > >
> > > > >> > > and I think it makes more sense, because EPT setup follows the same
> > > > >> > > rule.  If any future architechture sets tdp_root_level smaller than
> > > > >> > > max_tdp_level, the issue will happen again.
> > > > >> >
> > > > >> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> > > > >> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
> > > > >> > forced by tdp_root_level, then by definition it's also the max TDP level, because
> > > > >> > it's the _only_ TDP level KVM supports.
> > > > >>
> > > > >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> > > > >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> > > > >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> > > > >> every #PF,
> > > >
> > > > in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> > > > 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> > > > RSVD bits 51-48 set leads to EPT violation.
> > >
> > > At the completion of the page table walk, if there is a permission
> > > fault, the data address should not be accessed, so there should not be
> > > an EPT violation. Remember Meltdown?
> > >
> > > > >> and to emulate the faulting instruction to see if the RSVD
> > > > >> bit should be set in the error code. Hardware isn't going to do it.
> > > >
> > > > Note for EPT violation VM exits, the CPU stores the GPA that caused this exit
> > > > in "guest-physical address" field of VMCS. so, it is not necessary to emulate
> > > > the faulting instruction to determine if any RSVD bit is set.
> > >
> > > There should not be an EPT violation in the case discussed.
> >
> > For intercepted #PF, we can use CR2 to determine the necessary page
> > walk, and presumably the rest of the bits in the error code are
> > already set, so emulation is not necessary.
> >
> > However, emulation is necessary when synthesizing a #PF from an EPT
> > violation, and bit 8 of the exit qualification is clear. See
> > https://lore.kernel.org/kvm/4463f391-0a25-017e-f913-69c297e13c5e@redhat.com/.
>
> Although not all memory-accessing instructions are emulated, it covers most common
> cases and is always better than KVM hangs anyway. We may probably continue to
> improve allow_smaller_maxphyaddr, but KVM should report the maximum physical width
> it supports.

KVM can only support the host MAXPHYADDR. If EPT on the CPU doesn't
support host MAXPHYADDR, it should be disabled. Shadow paging can
handle host MAXPHYADDR just fine.

KVM simply does not work when guest MAXPHYADDR < host MAXPHYADDR.
Without additional hardware support, no hypervisor can. I asked Intel
to add hardware support for such configurations about 15 years ago. I
have yet to see it.

> Thanks,
> Tao
>
Chao Gao Jan. 4, 2024, 3:07 p.m. UTC | #22
On Wed, Jan 03, 2024 at 07:40:02PM -0800, Jim Mattson wrote:
>On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
>>
>> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
>> >On Tue, Jan 02, 2024, Jim Mattson wrote:
>> >> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
>> >> >
>> >> > On Thu, Dec 21, 2023, Xu Yilun wrote:
>> >> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
>> >> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> >> > > > > index c57e181bba21..72634d6b61b2 100644
>> >> > > > > --- a/arch/x86/kvm/mmu/mmu.c
>> >> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
>> >> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
>> >> > > > >   reset_guest_paging_metadata(vcpu, mmu);
>> >> > > > >  }
>> >> > > > >
>> >> > > > > +/* guest-physical-address bits limited by TDP */
>> >> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
>> >> > > > > +{
>> >> > > > > + return max_tdp_level == 5 ? 57 : 48;
>> >> > > >
>> >> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
>> >> > > >
>> >> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
>> >> > > >   guest-physical addresses are architecturally limited to 52 bits.
>> >> > > >
>> >> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
>> >> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
>> >> > > > is still accurate when tdp_root_level is non-zero).
>> >> > >
>> >> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
>> >> > > max_tdp_level:
>> >> > >
>> >> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
>> >> > >                         get_npt_level(), PG_LEVEL_1G);
>> >> > >
>> >> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
>> >> > > considered.  In your last proposal, it is:
>> >> > >
>> >> > >   u8 kvm_mmu_get_max_tdp_level(void)
>> >> > >   {
>> >> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
>> >> > >   }
>> >> > >
>> >> > > and I think it makes more sense, because EPT setup follows the same
>> >> > > rule.  If any future architechture sets tdp_root_level smaller than
>> >> > > max_tdp_level, the issue will happen again.
>> >> >
>> >> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
>> >> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
>> >> > forced by tdp_root_level, then by definition it's also the max TDP level, because
>> >> > it's the _only_ TDP level KVM supports.
>> >>
>> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
>> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
>> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
>> >> every #PF,
>>
>> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
>> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
>> RSVD bits 51-48 set leads to EPT violation.
>
>At the completion of the page table walk, if there is a permission
>fault, the data address should not be accessed, so there should not be
>an EPT violation. Remember Meltdown?

You are right. I missed this case. KVM needs to intercept #PF to set RSVD bit
in PFEC.
Jim Mattson Jan. 4, 2024, 5:02 p.m. UTC | #23
On Thu, Jan 4, 2024 at 7:08 AM Chao Gao <chao.gao@intel.com> wrote:
>
> On Wed, Jan 03, 2024 at 07:40:02PM -0800, Jim Mattson wrote:
> >On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> >>
> >> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> >> >On Tue, Jan 02, 2024, Jim Mattson wrote:
> >> >> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
> >> >> >
> >> >> > On Thu, Dec 21, 2023, Xu Yilun wrote:
> >> >> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> >> >> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> >> > > > > index c57e181bba21..72634d6b61b2 100644
> >> >> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> >> >> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> >> >> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >> >> > > > >   reset_guest_paging_metadata(vcpu, mmu);
> >> >> > > > >  }
> >> >> > > > >
> >> >> > > > > +/* guest-physical-address bits limited by TDP */
> >> >> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> >> >> > > > > +{
> >> >> > > > > + return max_tdp_level == 5 ? 57 : 48;
> >> >> > > >
> >> >> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> >> >> > > >
> >> >> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
> >> >> > > >   guest-physical addresses are architecturally limited to 52 bits.
> >> >> > > >
> >> >> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
> >> >> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> >> >> > > > is still accurate when tdp_root_level is non-zero).
> >> >> > >
> >> >> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> >> >> > > max_tdp_level:
> >> >> > >
> >> >> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> >> >> > >                         get_npt_level(), PG_LEVEL_1G);
> >> >> > >
> >> >> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
> >> >> > > considered.  In your last proposal, it is:
> >> >> > >
> >> >> > >   u8 kvm_mmu_get_max_tdp_level(void)
> >> >> > >   {
> >> >> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> >> >> > >   }
> >> >> > >
> >> >> > > and I think it makes more sense, because EPT setup follows the same
> >> >> > > rule.  If any future architechture sets tdp_root_level smaller than
> >> >> > > max_tdp_level, the issue will happen again.
> >> >> >
> >> >> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> >> >> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
> >> >> > forced by tdp_root_level, then by definition it's also the max TDP level, because
> >> >> > it's the _only_ TDP level KVM supports.
> >> >>
> >> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> >> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> >> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> >> >> every #PF,
> >>
> >> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> >> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> >> RSVD bits 51-48 set leads to EPT violation.
> >
> >At the completion of the page table walk, if there is a permission
> >fault, the data address should not be accessed, so there should not be
> >an EPT violation. Remember Meltdown?
>
> You are right. I missed this case. KVM needs to intercept #PF to set RSVD bit
> in PFEC.

I have no problem with a user deliberately choosing an unsupported
configuration, but I do have a problem with KVM_GET_SUPPORTED_CPUID
returning an unsupported configuration.

guest MAXPHYADDR < host MAXPHYADDR has the following issues:

1. In PAE mode, MOV to CR3 will not raise #GP for guest-reserved bits
in PDPTRs that are not host-reserved.
2. #PF for permission violations will not set the RSVD bit in the
error code for guest-reserved bits in the final data address PFN that
are not host-reserved.
3. #PF for other PFNs with guest-reserved bits that are not
host-reserved may not accurately set the non-RSVD bits (e.g. U/S, R/W)
in the error code.

Fix these three issues, and I will happily withdraw my objection.
Sean Christopherson Jan. 5, 2024, 8:26 p.m. UTC | #24
On Thu, Jan 04, 2024, Jim Mattson wrote:
> On Thu, Jan 4, 2024 at 7:08 AM Chao Gao <chao.gao@intel.com> wrote:
> >
> > On Wed, Jan 03, 2024 at 07:40:02PM -0800, Jim Mattson wrote:
> > >On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> > >>
> > >> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> > >> >On Tue, Jan 02, 2024, Jim Mattson wrote:
> > >> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> > >> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> > >> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> > >> >> every #PF,
> > >>
> > >> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> > >> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> > >> RSVD bits 51-48 set leads to EPT violation.
> > >
> > >At the completion of the page table walk, if there is a permission
> > >fault, the data address should not be accessed, so there should not be
> > >an EPT violation. Remember Meltdown?
> >
> > You are right. I missed this case. KVM needs to intercept #PF to set RSVD bit
> > in PFEC.
> 
> I have no problem with a user deliberately choosing an unsupported
> configuration, but I do have a problem with KVM_GET_SUPPORTED_CPUID
> returning an unsupported configuration.

+1

Advertising guest.MAXPHYADDR < host.MAXPHYADDR in KVM_GET_SUPPORTED_CPUID simply
isn't viable when TDP is enabled.  I suppose KVM do so when allow_smaller_maxphyaddr
is enabled, but that's just asking for confusion, e.g. if userspace reflects the
CPUID back into the guest, it could unknowingly create a VM that depends on
allow_smaller_maxphyaddr.

I think the least awful option is to have the kernel expose whether or not the
CPU support 5-level EPT to userspace.  That doesn't even require new uAPI per se,
just a new flag in /proc/cpuinfo.  It'll be a bit gross for userspace to parse,
but it's not the end of the world.  Alternatively, KVM could add a capability to
enumerate the max *addressable* GPA, but userspace would still need to manually
take action when KVM can't address all of memory, i.e. a capability would be less
ugly, but wouldn't meaningfully change userspace's responsibilities.

I.e.

diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index c6a7eed03914..266daf5b5b84 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -25,6 +25,7 @@
 #define VMX_FEATURE_EPT_EXECUTE_ONLY   ( 0*32+ 17) /* "ept_x_only" EPT entries can be execute only */
 #define VMX_FEATURE_EPT_AD             ( 0*32+ 18) /* EPT Accessed/Dirty bits */
 #define VMX_FEATURE_EPT_1GB            ( 0*32+ 19) /* 1GB EPT pages */
+#define VMX_FEATURE_EPT_5LEVEL         ( 0*32+ 20) /* 5-level EPT paging */
 
 /* Aggregated APIC features 24-27 */
 #define VMX_FEATURE_FLEXPRIORITY       ( 0*32+ 24) /* TPR shadow + virt APIC */
diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
index 03851240c3e3..1640ae76548f 100644
--- a/arch/x86/kernel/cpu/feat_ctl.c
+++ b/arch/x86/kernel/cpu/feat_ctl.c
@@ -72,6 +72,8 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
                c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_AD);
        if (ept & VMX_EPT_1GB_PAGE_BIT)
                c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_1GB);
+       if (ept & VMX_EPT_PAGE_WALK_5_BIT)
+               c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_5LEVEL);
 
        /* Synthetic APIC features that are aggregates of multiple features. */
        if ((c->vmx_capability[PRIMARY_CTLS] & VMX_F(VIRTUAL_TPR)) &&
Tao Su Jan. 8, 2024, 1:45 p.m. UTC | #25
On Fri, Jan 05, 2024 at 12:26:08PM -0800, Sean Christopherson wrote:
> On Thu, Jan 04, 2024, Jim Mattson wrote:
> > On Thu, Jan 4, 2024 at 7:08 AM Chao Gao <chao.gao@intel.com> wrote:
> > >
> > > On Wed, Jan 03, 2024 at 07:40:02PM -0800, Jim Mattson wrote:
> > > >On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> > > >>
> > > >> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> > > >> >On Tue, Jan 02, 2024, Jim Mattson wrote:
> > > >> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> > > >> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> > > >> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> > > >> >> every #PF,
> > > >>
> > > >> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> > > >> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> > > >> RSVD bits 51-48 set leads to EPT violation.
> > > >
> > > >At the completion of the page table walk, if there is a permission
> > > >fault, the data address should not be accessed, so there should not be
> > > >an EPT violation. Remember Meltdown?
> > >
> > > You are right. I missed this case. KVM needs to intercept #PF to set RSVD bit
> > > in PFEC.
> > 
> > I have no problem with a user deliberately choosing an unsupported
> > configuration, but I do have a problem with KVM_GET_SUPPORTED_CPUID
> > returning an unsupported configuration.
> 
> +1
> 
> Advertising guest.MAXPHYADDR < host.MAXPHYADDR in KVM_GET_SUPPORTED_CPUID simply
> isn't viable when TDP is enabled.  I suppose KVM do so when allow_smaller_maxphyaddr
> is enabled, but that's just asking for confusion, e.g. if userspace reflects the
> CPUID back into the guest, it could unknowingly create a VM that depends on
> allow_smaller_maxphyaddr.
> 
> I think the least awful option is to have the kernel expose whether or not the
> CPU support 5-level EPT to userspace.  That doesn't even require new uAPI per se,
> just a new flag in /proc/cpuinfo.  It'll be a bit gross for userspace to parse,
> but it's not the end of the world.  Alternatively, KVM could add a capability to
> enumerate the max *addressable* GPA, but userspace would still need to manually
> take action when KVM can't address all of memory, i.e. a capability would be less
> ugly, but wouldn't meaningfully change userspace's responsibilities.

Yes, exposing whether the CPU support 5-level EPT is indeed a better solution, it
not only bypasses the KVM_GET_SUPPORTED_CPUID but also provides the information to
userspace.

I think a new KVM capability to enumerate the max GPA is better since it will be
more convenient if userspace wants to use, e.g., automatically limit physical bits
or the GPA in the user memory region.

But only reporting this capability can’t solve the KVM hang issue, userspace can
choose to ignore the max GPA, e.g., six selftests in changelog are still failed. I
think we can reconsider patch2 if we don’t advertise
guest.MAXPHYADDR < host.MAXPHYADDR in KVM_GET_SUPPORTED_CPUID.

Thanks,
Tao

> 
> I.e.
> 
> diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
> index c6a7eed03914..266daf5b5b84 100644
> --- a/arch/x86/include/asm/vmxfeatures.h
> +++ b/arch/x86/include/asm/vmxfeatures.h
> @@ -25,6 +25,7 @@
>  #define VMX_FEATURE_EPT_EXECUTE_ONLY   ( 0*32+ 17) /* "ept_x_only" EPT entries can be execute only */
>  #define VMX_FEATURE_EPT_AD             ( 0*32+ 18) /* EPT Accessed/Dirty bits */
>  #define VMX_FEATURE_EPT_1GB            ( 0*32+ 19) /* 1GB EPT pages */
> +#define VMX_FEATURE_EPT_5LEVEL         ( 0*32+ 20) /* 5-level EPT paging */
>  
>  /* Aggregated APIC features 24-27 */
>  #define VMX_FEATURE_FLEXPRIORITY       ( 0*32+ 24) /* TPR shadow + virt APIC */
> diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
> index 03851240c3e3..1640ae76548f 100644
> --- a/arch/x86/kernel/cpu/feat_ctl.c
> +++ b/arch/x86/kernel/cpu/feat_ctl.c
> @@ -72,6 +72,8 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
>                 c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_AD);
>         if (ept & VMX_EPT_1GB_PAGE_BIT)
>                 c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_1GB);
> +       if (ept & VMX_EPT_PAGE_WALK_5_BIT)
> +               c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_5LEVEL);
>  
>         /* Synthetic APIC features that are aggregates of multiple features. */
>         if ((c->vmx_capability[PRIMARY_CTLS] & VMX_F(VIRTUAL_TPR)) &&
>
Sean Christopherson Jan. 8, 2024, 3:29 p.m. UTC | #26
On Mon, Jan 08, 2024, Tao Su wrote:
> On Fri, Jan 05, 2024 at 12:26:08PM -0800, Sean Christopherson wrote:
> > On Thu, Jan 04, 2024, Jim Mattson wrote:
> > > On Thu, Jan 4, 2024 at 7:08 AM Chao Gao <chao.gao@intel.com> wrote:
> > > >
> > > > On Wed, Jan 03, 2024 at 07:40:02PM -0800, Jim Mattson wrote:
> > > > >On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> > > > >>
> > > > >> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> > > > >> >On Tue, Jan 02, 2024, Jim Mattson wrote:
> > > > >> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> > > > >> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> > > > >> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> > > > >> >> every #PF,
> > > > >>
> > > > >> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> > > > >> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> > > > >> RSVD bits 51-48 set leads to EPT violation.
> > > > >
> > > > >At the completion of the page table walk, if there is a permission
> > > > >fault, the data address should not be accessed, so there should not be
> > > > >an EPT violation. Remember Meltdown?
> > > >
> > > > You are right. I missed this case. KVM needs to intercept #PF to set RSVD bit
> > > > in PFEC.
> > > 
> > > I have no problem with a user deliberately choosing an unsupported
> > > configuration, but I do have a problem with KVM_GET_SUPPORTED_CPUID
> > > returning an unsupported configuration.
> > 
> > +1
> > 
> > Advertising guest.MAXPHYADDR < host.MAXPHYADDR in KVM_GET_SUPPORTED_CPUID simply
> > isn't viable when TDP is enabled.  I suppose KVM do so when allow_smaller_maxphyaddr
> > is enabled, but that's just asking for confusion, e.g. if userspace reflects the
> > CPUID back into the guest, it could unknowingly create a VM that depends on
> > allow_smaller_maxphyaddr.
> > 
> > I think the least awful option is to have the kernel expose whether or not the
> > CPU support 5-level EPT to userspace.  That doesn't even require new uAPI per se,
> > just a new flag in /proc/cpuinfo.  It'll be a bit gross for userspace to parse,
> > but it's not the end of the world.  Alternatively, KVM could add a capability to
> > enumerate the max *addressable* GPA, but userspace would still need to manually
> > take action when KVM can't address all of memory, i.e. a capability would be less
> > ugly, but wouldn't meaningfully change userspace's responsibilities.
> 
> Yes, exposing whether the CPU support 5-level EPT is indeed a better solution, it
> not only bypasses the KVM_GET_SUPPORTED_CPUID but also provides the information to
> userspace.
> 
> I think a new KVM capability to enumerate the max GPA is better since it will be
> more convenient if userspace wants to use, e.g., automatically limit physical bits
> or the GPA in the user memory region.

Not really, because "automatically" limiting guest.MAXPHYADDR without setting
allow_smaller_maxphyaddr isn't exactly safe.  I think it's also useful to capture
*why* KVM's max addressable GPA is smaller than host.MAXPHYADDR, e.g. if down the
road someone ships a CPU that actually does the right thing when guest.MAXPHYADDR
is smaller than host.MAXPHYADDR.

> But only reporting this capability can’t solve the KVM hang issue, userspace can
> choose to ignore the max GPA, e.g., six selftests in changelog are still failed.

I know.  I just have more pressing concerns than making selftests work on funky
hardware that AFAIK isn't publicly available.

> I think we can reconsider patch2 if we don’t advertise
> guest.MAXPHYADDR < host.MAXPHYADDR in KVM_GET_SUPPORTED_CPUID.

Nah, someone just needs to update the selftests if/when the ept_5level flag
lands.
diff mbox series

Patch

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index dda6fc4cfae8..91933ca739ad 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1212,12 +1212,13 @@  static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		 *
 		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
 		 * provided, use the raw bare metal MAXPHYADDR as reductions to
-		 * the HPAs do not affect GPAs.
+		 * the HPAs do not affect GPAs, but ensure guest MAXPHYADDR
+		 * doesn't exceed the bits that TDP can translate.
 		 */
 		if (!tdp_enabled)
 			g_phys_as = boot_cpu_data.x86_phys_bits;
 		else if (!g_phys_as)
-			g_phys_as = phys_as;
+			g_phys_as = min(phys_as, kvm_mmu_tdp_maxphyaddr());
 
 		entry->eax = g_phys_as | (virt_as << 8);
 		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index bb8c86eefac0..1c7d649fcf6b 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -115,6 +115,7 @@  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 				u64 fault_address, char *insn, int insn_len);
 void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
 					struct kvm_mmu *mmu);
+unsigned int kvm_mmu_tdp_maxphyaddr(void);
 
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
 void kvm_mmu_unload(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c57e181bba21..72634d6b61b2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5177,6 +5177,13 @@  void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
 	reset_guest_paging_metadata(vcpu, mmu);
 }
 
+/* guest-physical-address bits limited by TDP */
+unsigned int kvm_mmu_tdp_maxphyaddr(void)
+{
+	return max_tdp_level == 5 ? 57 : 48;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_tdp_maxphyaddr);
+
 static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
 {
 	/* tdp_root_level is architecture forced level, use it if nonzero */