diff mbox series

[v7,2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}

Message ID 20230404130923.27749-3-binbin.wu@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Linear Address Masking (LAM) KVM Enabling | expand

Commit Message

Binbin Wu April 4, 2023, 1:09 p.m. UTC
From: Robert Hoo <robert.hu@linux.intel.com>

LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
masking for user mode pointers.

When EPT is on:
CR3 is fully under control of guest, guest LAM is thus transparent to KVM.

When EPT is off (shadow paging):
- KVM needs to handle guest CR3.LAM_U48 and CR3.LAM_U57 toggles.
  The two bits are allowed to be set in CR3 if vCPU supports LAM.
  The two bits should be kept as they are in the shadow CR3.
- Perform GFN calculation from guest CR3/PGD generically by extracting the
  maximal base address mask since the validity has been checked already.
- Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
  It could potentially increase root cache misses and mmu reload, however,
  it's very rare to turn off EPT when performace matters.

To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
the bits used to control supported features related to CR3 (e.g. LAM).
- Supported control bits are set to cr3_ctrl_bits after set cpuid.
- Add kvm_vcpu_is_legal_cr3() to validate CR3, tp allow setting of control
  bits for the supported features.
- Add kvm_get_active_cr3_ctrl_bits() to get the active control bits to form
  a new guest CR3 (in vmx_load_mmu_pgd()).

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 6 ++++++
 arch/x86/kvm/cpuid.h            | 5 +++++
 arch/x86/kvm/mmu.h              | 5 +++++
 arch/x86/kvm/mmu/mmu.c          | 6 +++++-
 arch/x86/kvm/mmu/mmu_internal.h | 1 +
 arch/x86/kvm/mmu/paging_tmpl.h  | 6 +++++-
 arch/x86/kvm/mmu/spte.h         | 2 +-
 arch/x86/kvm/vmx/nested.c       | 4 ++--
 arch/x86/kvm/vmx/vmx.c          | 6 +++++-
 arch/x86/kvm/x86.c              | 4 ++--
 10 files changed, 37 insertions(+), 8 deletions(-)

Comments

Huang, Kai April 6, 2023, 12:57 p.m. UTC | #1
On Tue, 2023-04-04 at 21:09 +0800, Binbin Wu wrote:
> From: Robert Hoo <robert.hu@linux.intel.com>
> 
> LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
> masking for user mode pointers.
> 
> When EPT is on:
> CR3 is fully under control of guest, guest LAM is thus transparent to KVM.
> 
> When EPT is off (shadow paging):
> - KVM needs to handle guest CR3.LAM_U48 and CR3.LAM_U57 toggles.
>   The two bits are allowed to be set in CR3 if vCPU supports LAM.
>   The two bits should be kept as they are in the shadow CR3.
> - Perform GFN calculation from guest CR3/PGD generically by extracting the
>   maximal base address mask since the validity has been checked already.
> - Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
>   It could potentially increase root cache misses and mmu reload, however,
>   it's very rare to turn off EPT when performace matters.
> 
> To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
> the bits used to control supported features related to CR3 (e.g. LAM).
> - Supported control bits are set to cr3_ctrl_bits after set cpuid.
> - Add kvm_vcpu_is_legal_cr3() to validate CR3, tp allow setting of control
						 ^
						 to ?
Could you run spell check for all patches?

>   bits for the supported features.
> - Add kvm_get_active_cr3_ctrl_bits() to get the active control bits to form
>   a new guest CR3 (in vmx_load_mmu_pgd()).

KVM handles #PF for shadow MMU, and for TDP (EPT) there's also a case that KVM
will trap the #PF (see allow_smaller_maxphyaddr).  Do we need any handling to
the linear address in the #PF, i.e. stripping metadata off while walking page
table?  

I guess it's already done automatically?  Anyway, IMO it would be better if you
can add one or two sentences in the changelog to clarify whether such handling
is needed, and if not, why.

> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> Tested-by: Xuelian Guo <xuelian.guo@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 6 ++++++
>  arch/x86/kvm/cpuid.h            | 5 +++++
>  arch/x86/kvm/mmu.h              | 5 +++++
>  arch/x86/kvm/mmu/mmu.c          | 6 +++++-
>  arch/x86/kvm/mmu/mmu_internal.h | 1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  | 6 +++++-
>  arch/x86/kvm/mmu/spte.h         | 2 +-
>  arch/x86/kvm/vmx/nested.c       | 4 ++--
>  arch/x86/kvm/vmx/vmx.c          | 6 +++++-
>  arch/x86/kvm/x86.c              | 4 ++--
>  10 files changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index ba594f9ea414..498d2b5e8dc1 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -729,6 +729,12 @@ struct kvm_vcpu_arch {
>  	unsigned long cr0_guest_owned_bits;
>  	unsigned long cr2;
>  	unsigned long cr3;
> +	/*
> +	 * Bits in CR3 used to enable certain features. These bits are allowed
> +	 * to be set in CR3 when vCPU supports the features. When shadow paging
> +	 * is used, these bits should be kept as they are in the shadow CR3.
> +	 */
> +	u64 cr3_ctrl_bits;
>  	unsigned long cr4;
>  	unsigned long cr4_guest_owned_bits;
>  	unsigned long cr4_guest_rsvd_bits;
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index b1658c0de847..ef8e1b912d7d 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -42,6 +42,11 @@ static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
>  	return vcpu->arch.maxphyaddr;
>  }
>  
> +static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> +{
> +	return !((cr3 & vcpu->arch.reserved_gpa_bits) & ~vcpu->arch.cr3_ctrl_bits);
> +}
> +
>  static inline bool kvm_vcpu_is_legal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
>  {
>  	return !(gpa & vcpu->arch.reserved_gpa_bits);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 168c46fd8dd1..29985eeb8e12 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -142,6 +142,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
>  	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
>  }
>  
> +static inline u64 kvm_get_active_cr3_ctrl_bits(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_read_cr3(vcpu) & vcpu->arch.cr3_ctrl_bits;
> +}
> +
>  static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
>  {
>  	u64 root_hpa = vcpu->arch.mmu->root.hpa;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8ebe542c565..de2c51a0b611 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3732,7 +3732,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	hpa_t root;
>  
>  	root_pgd = mmu->get_guest_pgd(vcpu);
> -	root_gfn = root_pgd >> PAGE_SHIFT;
> +	/*
> +	* The guest PGD has already been checked for validity, unconditionally
> +	* strip non-address bits when computing the GFN.
> +	*/
> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>  
>  	if (mmu_check_root(vcpu, root_gfn))
>  		return 1;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index cc58631e2336..c0479cbc2ca3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -21,6 +21,7 @@ extern bool dbg;
>  #endif
>  
>  /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
> +#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
>  #define __PT_LEVEL_SHIFT(level, bits_per_level)	\
>  	(PAGE_SHIFT + ((level) - 1) * (bits_per_level))
>  #define __PT_INDEX(address, level, bits_per_level) \
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 57f0b75c80f9..88351ba04249 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -62,7 +62,7 @@
>  #endif
>  
>  /* Common logic, but per-type values.  These also need to be undefined. */
> -#define PT_BASE_ADDR_MASK	((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)))
> +#define PT_BASE_ADDR_MASK	((pt_element_t)__PT_BASE_ADDR_MASK)
>  #define PT_LVL_ADDR_MASK(lvl)	__PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>  #define PT_LVL_OFFSET_MASK(lvl)	__PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>  #define PT_INDEX(addr, lvl)	__PT_INDEX(addr, lvl, PT_LEVEL_BITS)
> @@ -324,6 +324,10 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  	trace_kvm_mmu_pagetable_walk(addr, access);
>  retry_walk:
>  	walker->level = mmu->cpu_role.base.level;
> +	/*
> +	 * No need to mask cr3_ctrl_bits, gpte_to_gfn() will strip
> +	 * non-address bits.
> +	 */
>  	pte           = mmu->get_guest_pgd(vcpu);
>  	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
>  
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 1279db2eab44..777f7d443e3b 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -36,7 +36,7 @@ static_assert(SPTE_TDP_AD_ENABLED == 0);
>  #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
>  #define SPTE_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
>  #else
> -#define SPTE_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
> +#define SPTE_BASE_ADDR_MASK __PT_BASE_ADDR_MASK
>  #endif
>  
>  #define SPTE_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 1bc2b80273c9..d35bda9610e2 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1079,7 +1079,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
>  			       bool nested_ept, bool reload_pdptrs,
>  			       enum vm_entry_failure_code *entry_failure_code)
>  {
> -	if (CC(kvm_vcpu_is_illegal_gpa(vcpu, cr3))) {
> +	if (CC(!kvm_vcpu_is_legal_cr3(vcpu, cr3))) {
>  		*entry_failure_code = ENTRY_FAIL_DEFAULT;
>  		return -EINVAL;
>  	}
> @@ -2907,7 +2907,7 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
>  
>  	if (CC(!nested_host_cr0_valid(vcpu, vmcs12->host_cr0)) ||
>  	    CC(!nested_host_cr4_valid(vcpu, vmcs12->host_cr4)) ||
> -	    CC(kvm_vcpu_is_illegal_gpa(vcpu, vmcs12->host_cr3)))
> +	    CC(!kvm_vcpu_is_legal_cr3(vcpu, vmcs12->host_cr3)))
>  		return -EINVAL;
>  
>  	if (CC(is_noncanonical_address(vmcs12->host_ia32_sysenter_esp, vcpu)) ||
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 42f163862a0f..4d329ee9474c 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3388,7 +3388,8 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>  			update_guest_cr3 = false;
>  		vmx_ept_load_pdptrs(vcpu);
>  	} else {
> -		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu);
> +		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu) |
> +		            kvm_get_active_cr3_ctrl_bits(vcpu);
>  	}
>  
>  	if (update_guest_cr3)
> @@ -7763,6 +7764,9 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  		vmx->msr_ia32_feature_control_valid_bits &=
>  			~FEAT_CTL_SGX_LC_ENABLED;
>  
> +	if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
> +		vcpu->arch.cr3_ctrl_bits |= X86_CR3_LAM_U48 | X86_CR3_LAM_U57;
> +
>  	/* Refresh #PF interception to account for MAXPHYADDR changes. */
>  	vmx_update_exception_bitmap(vcpu);
>  }
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7713420abab0..aca255e69d0d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1260,7 +1260,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>  	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
>  	 * the current vCPU mode is accurate.
>  	 */
> -	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
> +	if (!kvm_vcpu_is_legal_cr3(vcpu, cr3))
>  		return 1;
>  
>  	if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, cr3))
> @@ -11349,7 +11349,7 @@ static bool kvm_is_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
>  		 */
>  		if (!(sregs->cr4 & X86_CR4_PAE) || !(sregs->efer & EFER_LMA))
>  			return false;
> -		if (kvm_vcpu_is_illegal_gpa(vcpu, sregs->cr3))
> +		if (!kvm_vcpu_is_legal_cr3(vcpu, sregs->cr3))
>  			return false;
>  	} else {
>  		/*
Binbin Wu April 9, 2023, 11:36 a.m. UTC | #2
On 4/6/2023 8:57 PM, Huang, Kai wrote:
> On Tue, 2023-04-04 at 21:09 +0800, Binbin Wu wrote:
>> From: Robert Hoo <robert.hu@linux.intel.com>
>>
>> LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
>> masking for user mode pointers.
>>
>> When EPT is on:
>> CR3 is fully under control of guest, guest LAM is thus transparent to KVM.
>>
>> When EPT is off (shadow paging):
>> - KVM needs to handle guest CR3.LAM_U48 and CR3.LAM_U57 toggles.
>>    The two bits are allowed to be set in CR3 if vCPU supports LAM.
>>    The two bits should be kept as they are in the shadow CR3.
>> - Perform GFN calculation from guest CR3/PGD generically by extracting the
>>    maximal base address mask since the validity has been checked already.
>> - Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
>>    It could potentially increase root cache misses and mmu reload, however,
>>    it's very rare to turn off EPT when performace matters.
>>
>> To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
>> the bits used to control supported features related to CR3 (e.g. LAM).
>> - Supported control bits are set to cr3_ctrl_bits after set cpuid.
>> - Add kvm_vcpu_is_legal_cr3() to validate CR3, tp allow setting of control
> 						 ^
> 						 to ?
> Could you run spell check for all patches?


OK, thanks for your advice.

>
>>    bits for the supported features.
>> - Add kvm_get_active_cr3_ctrl_bits() to get the active control bits to form
>>    a new guest CR3 (in vmx_load_mmu_pgd()).
> KVM handles #PF for shadow MMU, and for TDP (EPT) there's also a case that KVM
> will trap the #PF (see allow_smaller_maxphyaddr).  Do we need any handling to
> the linear address in the #PF, i.e. stripping metadata off while walking page
> table?
>
> I guess it's already done automatically?  Anyway, IMO it would be better if you
> can add one or two sentences in the changelog to clarify whether such handling
> is needed, and if not, why.

LAM masking applies before paging, so the faulting linear address 
doesn't contain the metadata.
It has been mentioned in cover letter, but to be clear, I will add the 
clarification in the changelog
of patch 4.

Thanks.


>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
>> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
>> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>> Tested-by: Xuelian Guo <xuelian.guo@intel.com>
>> ---
>>   arch/x86/include/asm/kvm_host.h | 6 ++++++
>>   arch/x86/kvm/cpuid.h            | 5 +++++
>>   arch/x86/kvm/mmu.h              | 5 +++++
>>   arch/x86/kvm/mmu/mmu.c          | 6 +++++-
>>   arch/x86/kvm/mmu/mmu_internal.h | 1 +
>>   arch/x86/kvm/mmu/paging_tmpl.h  | 6 +++++-
>>   arch/x86/kvm/mmu/spte.h         | 2 +-
>>   arch/x86/kvm/vmx/nested.c       | 4 ++--
>>   arch/x86/kvm/vmx/vmx.c          | 6 +++++-
>>   arch/x86/kvm/x86.c              | 4 ++--
>>   10 files changed, 37 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index ba594f9ea414..498d2b5e8dc1 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -729,6 +729,12 @@ struct kvm_vcpu_arch {
>>   	unsigned long cr0_guest_owned_bits;
>>   	unsigned long cr2;
>>   	unsigned long cr3;
>> +	/*
>> +	 * Bits in CR3 used to enable certain features. These bits are allowed
>> +	 * to be set in CR3 when vCPU supports the features. When shadow paging
>> +	 * is used, these bits should be kept as they are in the shadow CR3.
>> +	 */
>> +	u64 cr3_ctrl_bits;
>>   	unsigned long cr4;
>>   	unsigned long cr4_guest_owned_bits;
>>   	unsigned long cr4_guest_rsvd_bits;
>> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
>> index b1658c0de847..ef8e1b912d7d 100644
>> --- a/arch/x86/kvm/cpuid.h
>> +++ b/arch/x86/kvm/cpuid.h
>> @@ -42,6 +42,11 @@ static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
>>   	return vcpu->arch.maxphyaddr;
>>   }
>>   
>> +static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>> +{
>> +	return !((cr3 & vcpu->arch.reserved_gpa_bits) & ~vcpu->arch.cr3_ctrl_bits);
>> +}
>> +
>>   static inline bool kvm_vcpu_is_legal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
>>   {
>>   	return !(gpa & vcpu->arch.reserved_gpa_bits);
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index 168c46fd8dd1..29985eeb8e12 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -142,6 +142,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
>>   	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
>>   }
>>   
>> +static inline u64 kvm_get_active_cr3_ctrl_bits(struct kvm_vcpu *vcpu)
>> +{
>> +	return kvm_read_cr3(vcpu) & vcpu->arch.cr3_ctrl_bits;
>> +}
>> +
>>   static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
>>   {
>>   	u64 root_hpa = vcpu->arch.mmu->root.hpa;
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index c8ebe542c565..de2c51a0b611 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -3732,7 +3732,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>>   	hpa_t root;
>>   
>>   	root_pgd = mmu->get_guest_pgd(vcpu);
>> -	root_gfn = root_pgd >> PAGE_SHIFT;
>> +	/*
>> +	* The guest PGD has already been checked for validity, unconditionally
>> +	* strip non-address bits when computing the GFN.
>> +	*/
>> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>>   
>>   	if (mmu_check_root(vcpu, root_gfn))
>>   		return 1;
>> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
>> index cc58631e2336..c0479cbc2ca3 100644
>> --- a/arch/x86/kvm/mmu/mmu_internal.h
>> +++ b/arch/x86/kvm/mmu/mmu_internal.h
>> @@ -21,6 +21,7 @@ extern bool dbg;
>>   #endif
>>   
>>   /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
>> +#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
>>   #define __PT_LEVEL_SHIFT(level, bits_per_level)	\
>>   	(PAGE_SHIFT + ((level) - 1) * (bits_per_level))
>>   #define __PT_INDEX(address, level, bits_per_level) \
>> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
>> index 57f0b75c80f9..88351ba04249 100644
>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>> @@ -62,7 +62,7 @@
>>   #endif
>>   
>>   /* Common logic, but per-type values.  These also need to be undefined. */
>> -#define PT_BASE_ADDR_MASK	((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)))
>> +#define PT_BASE_ADDR_MASK	((pt_element_t)__PT_BASE_ADDR_MASK)
>>   #define PT_LVL_ADDR_MASK(lvl)	__PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>>   #define PT_LVL_OFFSET_MASK(lvl)	__PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>>   #define PT_INDEX(addr, lvl)	__PT_INDEX(addr, lvl, PT_LEVEL_BITS)
>> @@ -324,6 +324,10 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>>   	trace_kvm_mmu_pagetable_walk(addr, access);
>>   retry_walk:
>>   	walker->level = mmu->cpu_role.base.level;
>> +	/*
>> +	 * No need to mask cr3_ctrl_bits, gpte_to_gfn() will strip
>> +	 * non-address bits.
>> +	 */
>>   	pte           = mmu->get_guest_pgd(vcpu);
>>   	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
>>   
>> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
>> index 1279db2eab44..777f7d443e3b 100644
>> --- a/arch/x86/kvm/mmu/spte.h
>> +++ b/arch/x86/kvm/mmu/spte.h
>> @@ -36,7 +36,7 @@ static_assert(SPTE_TDP_AD_ENABLED == 0);
>>   #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
>>   #define SPTE_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
>>   #else
>> -#define SPTE_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
>> +#define SPTE_BASE_ADDR_MASK __PT_BASE_ADDR_MASK
>>   #endif
>>   
>>   #define SPTE_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 1bc2b80273c9..d35bda9610e2 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -1079,7 +1079,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
>>   			       bool nested_ept, bool reload_pdptrs,
>>   			       enum vm_entry_failure_code *entry_failure_code)
>>   {
>> -	if (CC(kvm_vcpu_is_illegal_gpa(vcpu, cr3))) {
>> +	if (CC(!kvm_vcpu_is_legal_cr3(vcpu, cr3))) {
>>   		*entry_failure_code = ENTRY_FAIL_DEFAULT;
>>   		return -EINVAL;
>>   	}
>> @@ -2907,7 +2907,7 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
>>   
>>   	if (CC(!nested_host_cr0_valid(vcpu, vmcs12->host_cr0)) ||
>>   	    CC(!nested_host_cr4_valid(vcpu, vmcs12->host_cr4)) ||
>> -	    CC(kvm_vcpu_is_illegal_gpa(vcpu, vmcs12->host_cr3)))
>> +	    CC(!kvm_vcpu_is_legal_cr3(vcpu, vmcs12->host_cr3)))
>>   		return -EINVAL;
>>   
>>   	if (CC(is_noncanonical_address(vmcs12->host_ia32_sysenter_esp, vcpu)) ||
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 42f163862a0f..4d329ee9474c 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -3388,7 +3388,8 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>>   			update_guest_cr3 = false;
>>   		vmx_ept_load_pdptrs(vcpu);
>>   	} else {
>> -		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu);
>> +		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu) |
>> +		            kvm_get_active_cr3_ctrl_bits(vcpu);
>>   	}
>>   
>>   	if (update_guest_cr3)
>> @@ -7763,6 +7764,9 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>   		vmx->msr_ia32_feature_control_valid_bits &=
>>   			~FEAT_CTL_SGX_LC_ENABLED;
>>   
>> +	if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
>> +		vcpu->arch.cr3_ctrl_bits |= X86_CR3_LAM_U48 | X86_CR3_LAM_U57;
>> +
>>   	/* Refresh #PF interception to account for MAXPHYADDR changes. */
>>   	vmx_update_exception_bitmap(vcpu);
>>   }
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 7713420abab0..aca255e69d0d 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1260,7 +1260,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>>   	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
>>   	 * the current vCPU mode is accurate.
>>   	 */
>> -	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
>> +	if (!kvm_vcpu_is_legal_cr3(vcpu, cr3))
>>   		return 1;
>>   
>>   	if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, cr3))
>> @@ -11349,7 +11349,7 @@ static bool kvm_is_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
>>   		 */
>>   		if (!(sregs->cr4 & X86_CR4_PAE) || !(sregs->efer & EFER_LMA))
>>   			return false;
>> -		if (kvm_vcpu_is_illegal_gpa(vcpu, sregs->cr3))
>> +		if (!kvm_vcpu_is_legal_cr3(vcpu, sregs->cr3))
>>   			return false;
>>   	} else {
>>   		/*
Huang, Kai April 11, 2023, 11:11 p.m. UTC | #3
> > 
> > >    bits for the supported features.
> > > - Add kvm_get_active_cr3_ctrl_bits() to get the active control bits to form
> > >    a new guest CR3 (in vmx_load_mmu_pgd()).
> > KVM handles #PF for shadow MMU, and for TDP (EPT) there's also a case that KVM
> > will trap the #PF (see allow_smaller_maxphyaddr).  Do we need any handling to
> > the linear address in the #PF, i.e. stripping metadata off while walking page
> > table?
> > 
> > I guess it's already done automatically?  Anyway, IMO it would be better if you
> > can add one or two sentences in the changelog to clarify whether such handling
> > is needed, and if not, why.
> 
> LAM masking applies before paging, so the faulting linear address 
> doesn't contain the metadata.
> It has been mentioned in cover letter, but to be clear, I will add the 
> clarification in the changelog
> of patch 4.

Yes I think this will be helpful.  Thanks.
Huang, Kai April 12, 2023, 11:58 a.m. UTC | #4
> 
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -729,6 +729,12 @@ struct kvm_vcpu_arch {
>  	unsigned long cr0_guest_owned_bits;
>  	unsigned long cr2;
>  	unsigned long cr3;
> +	/*
> +	 * Bits in CR3 used to enable certain features. These bits are allowed
> +	 * to be set in CR3 when vCPU supports the features. When shadow paging
> +	 * is used, these bits should be kept as they are in the shadow CR3.
> +	 */

I don't quite follow the second sentence.  Not sure what does "these bits should
be kept" mean.  

Those control bits are not active bits in guest's CR3 but all control bits that
guest is allowed to set to CR3. And those bits depends on guest's CPUID but not
whether guest is using shadow paging or not.

I think you can just remove the second sentence.

> +	u64 cr3_ctrl_bits;
>  	unsigned long cr4;
>  	unsigned long cr4_guest_owned_bits;
>  	unsigned long cr4_guest_rsvd_bits;
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index b1658c0de847..ef8e1b912d7d 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -42,6 +42,11 @@ static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
>  	return vcpu->arch.maxphyaddr;
>  }
>  
> +static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> +{
> +	return !((cr3 & vcpu->arch.reserved_gpa_bits) & ~vcpu->arch.cr3_ctrl_bits);
> +}
> +
>  static inline bool kvm_vcpu_is_legal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
>  {
>  	return !(gpa & vcpu->arch.reserved_gpa_bits);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 168c46fd8dd1..29985eeb8e12 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -142,6 +142,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
>  	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
>  }
>  
> +static inline u64 kvm_get_active_cr3_ctrl_bits(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_read_cr3(vcpu) & vcpu->arch.cr3_ctrl_bits;
> +}
> +
>  static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
>  {
>  	u64 root_hpa = vcpu->arch.mmu->root.hpa;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8ebe542c565..de2c51a0b611 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3732,7 +3732,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	hpa_t root;
>  
>  	root_pgd = mmu->get_guest_pgd(vcpu);
> -	root_gfn = root_pgd >> PAGE_SHIFT;
> +	/*
> +	* The guest PGD has already been checked for validity, unconditionally
> +	* strip non-address bits when computing the GFN.
> +	*/

Don't quite follow this comment either.  Can we just say:

	/*
	 * Guest's PGD may contain additional control bits.  Mask them off
	 * to get the GFN.
	 */

Which explains why it has "non-address bits" and needs mask off?

> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;

Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this way, below
mmu_check_root() may potentially catch other invalid bits, but in practice there
should be no difference I guess. 

>  
>  	if (mmu_check_root(vcpu, root_gfn))
>  		return 1;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index cc58631e2336..c0479cbc2ca3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -21,6 +21,7 @@ extern bool dbg;
>  #endif
>  
>  /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
> +#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
>  #define __PT_LEVEL_SHIFT(level, bits_per_level)	\
>  	(PAGE_SHIFT + ((level) - 1) * (bits_per_level))
>  #define __PT_INDEX(address, level, bits_per_level) \
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 57f0b75c80f9..88351ba04249 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -62,7 +62,7 @@
>  #endif
>  
>  /* Common logic, but per-type values.  These also need to be undefined. */
> -#define PT_BASE_ADDR_MASK	((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)))
> +#define PT_BASE_ADDR_MASK	((pt_element_t)__PT_BASE_ADDR_MASK)
>  #define PT_LVL_ADDR_MASK(lvl)	__PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>  #define PT_LVL_OFFSET_MASK(lvl)	__PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>  #define PT_INDEX(addr, lvl)	__PT_INDEX(addr, lvl, PT_LEVEL_BITS)
> @@ -324,6 +324,10 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  	trace_kvm_mmu_pagetable_walk(addr, access);
>  retry_walk:
>  	walker->level = mmu->cpu_role.base.level;
> +	/*
> +	 * No need to mask cr3_ctrl_bits, gpte_to_gfn() will strip
> +	 * non-address bits.
> +	 */

I guess it will be helpful if we actually call out that guest's PGD may contain
control bits here.

Also, I am not sure whether it's better to just explicitly mask control bits off
here.

>  	pte           = mmu->get_guest_pgd(vcpu);
>  	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
>  
> 

[...]
Binbin Wu April 13, 2023, 1:36 a.m. UTC | #5
On 4/12/2023 7:58 PM, Huang, Kai wrote:
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -729,6 +729,12 @@ struct kvm_vcpu_arch {
>>   	unsigned long cr0_guest_owned_bits;
>>   	unsigned long cr2;
>>   	unsigned long cr3;
>> +	/*
>> +	 * Bits in CR3 used to enable certain features. These bits are allowed
>> +	 * to be set in CR3 when vCPU supports the features. When shadow paging
>> +	 * is used, these bits should be kept as they are in the shadow CR3.
>> +	 */
> I don't quite follow the second sentence.  Not sure what does "these bits should
> be kept" mean.
>
> Those control bits are not active bits in guest's CR3 but all control bits that
> guest is allowed to set to CR3. And those bits depends on guest's CPUID but not
> whether guest is using shadow paging or not.
>
> I think you can just remove the second sentence.

Yes, you are right. The second sentenc is confusing.

How about this:

+	/*
+	 * Bits in CR3 used to enable certain features. These bits are allowed
+	 * to be set in CR3 when vCPU supports the features, and they are used
+	 * as the mask to get the active control bits to form a new guest CR3.
+	 */


>
>> +	u64 cr3_ctrl_bits;
>>   	unsigned long cr4;
>>   	unsigned long cr4_guest_owned_bits;
>>   	unsigned long cr4_guest_rsvd_bits;
>> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
>> index b1658c0de847..ef8e1b912d7d 100644
>> --- a/arch/x86/kvm/cpuid.h
>> +++ b/arch/x86/kvm/cpuid.h
>> @@ -42,6 +42,11 @@ static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
>>   	return vcpu->arch.maxphyaddr;
>>   }
>>   
>> +static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>> +{
>> +	return !((cr3 & vcpu->arch.reserved_gpa_bits) & ~vcpu->arch.cr3_ctrl_bits);
>> +}
>> +
>>   static inline bool kvm_vcpu_is_legal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
>>   {
>>   	return !(gpa & vcpu->arch.reserved_gpa_bits);
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index 168c46fd8dd1..29985eeb8e12 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -142,6 +142,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
>>   	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
>>   }
>>   
>> +static inline u64 kvm_get_active_cr3_ctrl_bits(struct kvm_vcpu *vcpu)
>> +{
>> +	return kvm_read_cr3(vcpu) & vcpu->arch.cr3_ctrl_bits;
>> +}
>> +
>>   static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
>>   {
>>   	u64 root_hpa = vcpu->arch.mmu->root.hpa;
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index c8ebe542c565..de2c51a0b611 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -3732,7 +3732,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>>   	hpa_t root;
>>   
>>   	root_pgd = mmu->get_guest_pgd(vcpu);
>> -	root_gfn = root_pgd >> PAGE_SHIFT;
>> +	/*
>> +	* The guest PGD has already been checked for validity, unconditionally
>> +	* strip non-address bits when computing the GFN.
>> +	*/

The comment here is to call out that the set non-address bit(s) have 
been checked for legality
before, it is safe to strip these bits.


> Don't quite follow this comment either.  Can we just say:
>
> 	/*
> 	 * Guest's PGD may contain additional control bits.  Mask them off
> 	 * to get the GFN.
> 	 */
>
> Which explains why it has "non-address bits" and needs mask off?

How about merge this comment?


>
>> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
> Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this way, below
> mmu_check_root() may potentially catch other invalid bits, but in practice there
> should be no difference I guess.

In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.

However, Sean pointed out that the return value of 
mmu->get_guest_pgd(vcpu) could be
EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.

Since the guest pgd has been check for valadity, for both CR3 and EPTP, 
it is safe to mask off
non-address bits to get GFN.

Maybe I should add this CR3 VS. EPTP part to the changelog to make it 
more undertandable.



>
>>   
>>   	if (mmu_check_root(vcpu, root_gfn))
>>   		return 1;
>> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
>> index cc58631e2336..c0479cbc2ca3 100644
>> --- a/arch/x86/kvm/mmu/mmu_internal.h
>> +++ b/arch/x86/kvm/mmu/mmu_internal.h
>> @@ -21,6 +21,7 @@ extern bool dbg;
>>   #endif
>>   
>>   /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
>> +#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
>>   #define __PT_LEVEL_SHIFT(level, bits_per_level)	\
>>   	(PAGE_SHIFT + ((level) - 1) * (bits_per_level))
>>   #define __PT_INDEX(address, level, bits_per_level) \
>> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
>> index 57f0b75c80f9..88351ba04249 100644
>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>> @@ -62,7 +62,7 @@
>>   #endif
>>   
>>   /* Common logic, but per-type values.  These also need to be undefined. */
>> -#define PT_BASE_ADDR_MASK	((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)))
>> +#define PT_BASE_ADDR_MASK	((pt_element_t)__PT_BASE_ADDR_MASK)
>>   #define PT_LVL_ADDR_MASK(lvl)	__PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>>   #define PT_LVL_OFFSET_MASK(lvl)	__PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>>   #define PT_INDEX(addr, lvl)	__PT_INDEX(addr, lvl, PT_LEVEL_BITS)
>> @@ -324,6 +324,10 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>>   	trace_kvm_mmu_pagetable_walk(addr, access);
>>   retry_walk:
>>   	walker->level = mmu->cpu_role.base.level;
>> +	/*
>> +	 * No need to mask cr3_ctrl_bits, gpte_to_gfn() will strip
>> +	 * non-address bits.
>> +	 */
> I guess it will be helpful if we actually call out that guest's PGD may contain
> control bits here.

OK.


>
> Also, I am not sure whether it's better to just explicitly mask control bits off
> here.

Same reason as mention above.


>
>>   	pte           = mmu->get_guest_pgd(vcpu);
>>   	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
>>   
>>
> [...]
Huang, Kai April 13, 2023, 2:27 a.m. UTC | #6
On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
> On 4/12/2023 7:58 PM, Huang, Kai wrote:
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -729,6 +729,12 @@ struct kvm_vcpu_arch {
> > >   	unsigned long cr0_guest_owned_bits;
> > >   	unsigned long cr2;
> > >   	unsigned long cr3;
> > > +	/*
> > > +	 * Bits in CR3 used to enable certain features. These bits are allowed
> > > +	 * to be set in CR3 when vCPU supports the features. When shadow paging
> > > +	 * is used, these bits should be kept as they are in the shadow CR3.
> > > +	 */
> > I don't quite follow the second sentence.  Not sure what does "these bits should
> > be kept" mean.
> > 
> > Those control bits are not active bits in guest's CR3 but all control bits that
> > guest is allowed to set to CR3. And those bits depends on guest's CPUID but not
> > whether guest is using shadow paging or not.
> > 
> > I think you can just remove the second sentence.
> 
> Yes, you are right. The second sentenc is confusing.
> 
> How about this:
> 
> +	/*
> +	 * Bits in CR3 used to enable certain features. These bits are allowed
> +	 * to be set in CR3 when vCPU supports the features, and they are used
> +	 * as the mask to get the active control bits to form a new guest CR3.
> +	 */
> 

Fine to me, but IMHO it can be even simpler, for instance:

	/*
	 * CR3 non-address feature control bits.  Guest's CR3 may contain
	 * any of those bits at runtime.
	 */

> 
> > 
> > > +	u64 cr3_ctrl_bits;
> > >   	unsigned long cr4;
> > >   	unsigned long cr4_guest_owned_bits;
> > >   	unsigned long cr4_guest_rsvd_bits;
> > > diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> > > index b1658c0de847..ef8e1b912d7d 100644
> > > --- a/arch/x86/kvm/cpuid.h
> > > +++ b/arch/x86/kvm/cpuid.h
> > > @@ -42,6 +42,11 @@ static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
> > >   	return vcpu->arch.maxphyaddr;
> > >   }
> > >   
> > > +static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> > > +{
> > > +	return !((cr3 & vcpu->arch.reserved_gpa_bits) & ~vcpu->arch.cr3_ctrl_bits);
> > > +}
> > > +
> > >   static inline bool kvm_vcpu_is_legal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
> > >   {
> > >   	return !(gpa & vcpu->arch.reserved_gpa_bits);
> > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > > index 168c46fd8dd1..29985eeb8e12 100644
> > > --- a/arch/x86/kvm/mmu.h
> > > +++ b/arch/x86/kvm/mmu.h
> > > @@ -142,6 +142,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
> > >   	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
> > >   }
> > >   
> > > +static inline u64 kvm_get_active_cr3_ctrl_bits(struct kvm_vcpu *vcpu)
> > > +{
> > > +	return kvm_read_cr3(vcpu) & vcpu->arch.cr3_ctrl_bits;
> > > +}
> > > +
> > >   static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
> > >   {
> > >   	u64 root_hpa = vcpu->arch.mmu->root.hpa;
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index c8ebe542c565..de2c51a0b611 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3732,7 +3732,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
> > >   	hpa_t root;
> > >   
> > >   	root_pgd = mmu->get_guest_pgd(vcpu);
> > > -	root_gfn = root_pgd >> PAGE_SHIFT;
> > > +	/*
> > > +	* The guest PGD has already been checked for validity, unconditionally
> > > +	* strip non-address bits when computing the GFN.
> > > +	*/
> 
> The comment here is to call out that the set non-address bit(s) have 
> been checked for legality
> before, it is safe to strip these bits.
> 
> 
> > Don't quite follow this comment either.  Can we just say:
> > 
> > 	/*
> > 	 * Guest's PGD may contain additional control bits.  Mask them off
> > 	 * to get the GFN.
> > 	 */
> > 
> > Which explains why it has "non-address bits" and needs mask off?
> 
> How about merge this comment?

No strong opinion.  

> 
> 
> > 
> > > +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
> > Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this way, below
> > mmu_check_root() may potentially catch other invalid bits, but in practice there
> > should be no difference I guess.
> 
> In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
> 
> However, Sean pointed out that the return value of 
> mmu->get_guest_pgd(vcpu) could be
> EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.

Yes, although EPTP's high bits don't contain any control bits.

But perhaps we want to make it future-proof in case some more control bits are
added to EPTP too.

> 
> Since the guest pgd has been check for valadity, for both CR3 and EPTP, 
> it is safe to mask off
> non-address bits to get GFN.
> 
> Maybe I should add this CR3 VS. EPTP part to the changelog to make it 
> more undertandable.

This isn't necessary, and can/should be done in comments if needed.

But IMHO you may want to add more material to explain how nested cases are
handled.
Binbin Wu April 13, 2023, 4:45 a.m. UTC | #7
On 4/13/2023 10:27 AM, Huang, Kai wrote:
> On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
>> On 4/12/2023 7:58 PM, Huang, Kai wrote:
>>
...
>>>> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>>> Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this way, below
>>> mmu_check_root() may potentially catch other invalid bits, but in practice there
>>> should be no difference I guess.
>> In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
>>
>> However, Sean pointed out that the return value of
>> mmu->get_guest_pgd(vcpu) could be
>> EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
> Yes, although EPTP's high bits don't contain any control bits.
>
> But perhaps we want to make it future-proof in case some more control bits are
> added to EPTP too.
>
>> Since the guest pgd has been check for valadity, for both CR3 and EPTP,
>> it is safe to mask off
>> non-address bits to get GFN.
>>
>> Maybe I should add this CR3 VS. EPTP part to the changelog to make it
>> more undertandable.
> This isn't necessary, and can/should be done in comments if needed.
>
> But IMHO you may want to add more material to explain how nested cases are
> handled.

Do you mean about CR3 or others?
Huang, Kai April 13, 2023, 9:13 a.m. UTC | #8
> On 4/13/2023 10:27 AM, Huang, Kai wrote:
> > On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
> >> On 4/12/2023 7:58 PM, Huang, Kai wrote:
> >>
> ...
> >>>> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
> >>> Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this
> >>> way, below
> >>> mmu_check_root() may potentially catch other invalid bits, but in
> >>> practice there should be no difference I guess.
> >> In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
> >>
> >> However, Sean pointed out that the return value of
> >> mmu->get_guest_pgd(vcpu) could be
> >> EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
> > Yes, although EPTP's high bits don't contain any control bits.
> >
> > But perhaps we want to make it future-proof in case some more control
> > bits are added to EPTP too.
> >
> >> Since the guest pgd has been check for valadity, for both CR3 and
> >> EPTP, it is safe to mask off non-address bits to get GFN.
> >>
> >> Maybe I should add this CR3 VS. EPTP part to the changelog to make it
> >> more undertandable.
> > This isn't necessary, and can/should be done in comments if needed.
> >
> > But IMHO you may want to add more material to explain how nested cases
> > are handled.
> 
> Do you mean about CR3 or others?
> 

This patch is about CR3, so CR3.
Chao Gao April 17, 2023, 7:24 a.m. UTC | #9
On Tue, Apr 04, 2023 at 09:09:20PM +0800, Binbin Wu wrote:
> /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
>+#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))

This is an open-coded GENMASK_ULL(). So, you'd better use GENMASK_ULL() here.

>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -1260,7 +1260,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> 	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
> 	 * the current vCPU mode is accurate.
> 	 */
>-	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
>+	if (!kvm_vcpu_is_legal_cr3(vcpu, cr3))

I prefer to modify the call sites in SVM nested code to use the new
function. Although this change does not affect functionality, it
provides a clear distinction between CR3 checks and GPA checks.
Binbin Wu April 17, 2023, 8:02 a.m. UTC | #10
On 4/17/2023 3:24 PM, Chao Gao wrote:
> On Tue, Apr 04, 2023 at 09:09:20PM +0800, Binbin Wu wrote:
>> /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
>> +#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
> This is an open-coded(). So, you'd better use GENMASK_ULL() here.

Here basically is a code move and rename from PT_BASE_ADDR_MASK to 
__PT_BASE_ADDR_MASK.
I didn't change the original code, but if it is preferred to use 
GENMASK_ULL()
in kernel/KVM, I can change it as following:

#define __PT_BASE_ADDR_MASK GENMASK_ULL(51, 12)


>
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1260,7 +1260,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>> 	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
>> 	 * the current vCPU mode is accurate.
>> 	 */
>> -	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
>> +	if (!kvm_vcpu_is_legal_cr3(vcpu, cr3))
> I prefer to modify the call sites in SVM nested code to use the new
> function. Although this change does not affect functionality, it
> provides a clear distinction between CR3 checks and GPA checks.

Make sense, will do it.
Binbin Wu April 21, 2023, 6:35 a.m. UTC | #11
On 4/13/2023 5:13 PM, Huang, Kai wrote:
>> On 4/13/2023 10:27 AM, Huang, Kai wrote:
>>> On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
>>>> On 4/12/2023 7:58 PM, Huang, Kai wrote:
>>>>
>> ...
>>>>>> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>>>>> Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this
>>>>> way, below
>>>>> mmu_check_root() may potentially catch other invalid bits, but in
>>>>> practice there should be no difference I guess.
>>>> In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
>>>>
>>>> However, Sean pointed out that the return value of
>>>> mmu->get_guest_pgd(vcpu) could be
>>>> EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
>>> Yes, although EPTP's high bits don't contain any control bits.
>>>
>>> But perhaps we want to make it future-proof in case some more control
>>> bits are added to EPTP too.
>>>
>>>> Since the guest pgd has been check for valadity, for both CR3 and
>>>> EPTP, it is safe to mask off non-address bits to get GFN.
>>>>
>>>> Maybe I should add this CR3 VS. EPTP part to the changelog to make it
>>>> more undertandable.
>>> This isn't necessary, and can/should be done in comments if needed.
>>>
>>> But IMHO you may want to add more material to explain how nested cases
>>> are handled.
>> Do you mean about CR3 or others?
>>
> This patch is about CR3, so CR3.

For nested case, I plan to add the following in the changelog:

     For nested guest:
     - If CR3 is intercepted, after CR3 handled in L1, CR3 will be 
checked in
       nested_vmx_load_cr3() before returning back to L2.
     - For the shadow paging case (SPT02), LAM bits are also be handled 
to form
       a new shadow CR3 in vmx_load_mmu_pgd().
Huang, Kai April 21, 2023, 11:43 a.m. UTC | #12
On Fri, 2023-04-21 at 14:35 +0800, Binbin Wu wrote:
> On 4/13/2023 5:13 PM, Huang, Kai wrote:
> > > On 4/13/2023 10:27 AM, Huang, Kai wrote:
> > > > On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
> > > > > On 4/12/2023 7:58 PM, Huang, Kai wrote:
> > > > > 
> > > ...
> > > > > > > +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
> > > > > > Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this
> > > > > > way, below
> > > > > > mmu_check_root() may potentially catch other invalid bits, but in
> > > > > > practice there should be no difference I guess.
> > > > > In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
> > > > > 
> > > > > However, Sean pointed out that the return value of
> > > > > mmu->get_guest_pgd(vcpu) could be
> > > > > EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
> > > > Yes, although EPTP's high bits don't contain any control bits.
> > > > 
> > > > But perhaps we want to make it future-proof in case some more control
> > > > bits are added to EPTP too.
> > > > 
> > > > > Since the guest pgd has been check for valadity, for both CR3 and
> > > > > EPTP, it is safe to mask off non-address bits to get GFN.
> > > > > 
> > > > > Maybe I should add this CR3 VS. EPTP part to the changelog to make it
> > > > > more undertandable.
> > > > This isn't necessary, and can/should be done in comments if needed.
> > > > 
> > > > But IMHO you may want to add more material to explain how nested cases
> > > > are handled.
> > > Do you mean about CR3 or others?
> > > 
> > This patch is about CR3, so CR3.
> 
> For nested case, I plan to add the following in the changelog:
> 
>      For nested guest:
>      - If CR3 is intercepted, after CR3 handled in L1, CR3 will be 
> checked in
>        nested_vmx_load_cr3() before returning back to L2.
>      - For the shadow paging case (SPT02), LAM bits are also be handled 
> to form
>        a new shadow CR3 in vmx_load_mmu_pgd().
> 
> 

I don't know a lot of code detail of KVM nested code, but in general, since your
code only changes nested_vmx_load_cr3() and nested_vmx_check_host_state(), the
changelog should focus on explaining why modifying these two functions are good
enough.

And to explain this, I think we should explain from hardware's perspective
rather than from shadow paging's perspective.

From L0's perspective, we care about:

	1) L1's CR3 register (VMCS01's GUEST_CR3)
	2) L1's VMCS to run L2 guest
		2.1) VMCS12's HOST_CR3 
		2.2) VMCS12's GUEST_CR3

For 1) the current changelog has explained (that we need to catch _active_
control bits in guest's CR3 etc).  For nested (case 2)), L1 can choose to
intercept CR3 or not.  But this isn't the point because from hardware's point of
view we actually care about below two cases instead:

	1) L0 to emulate a VMExit from L2 to L1 by VMENTER using VMCS01 
	   to reflect VMCS12
	2) L0 to VMENTER to L2 using VMCS02 directly

The case 2) can fail due to fail to VMENTER to L2 of course but this should
result in L0 to VMENTER to L1 with a emulated VMEXIT from L2 to L1 which is the
case 1).

For case 1) we need new code to check VMCS12's HOST_CR3 against guest's _active_
CR3 feature control bits.  The key code path is:

	vmx_handle_exit()
		-> prepare_vmcs12()
		-> load_vmcs12_host_state().  

For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
GUEST_CR3 against active control bits.  The key code path is 

	nested_vmx_run() -> 
		-> nested_vmx_check_host_state()
		-> nested_vmx_enter_non_root_mode()
			-> prepare_vmcs02_early()
			-> prepare_vmcs02()

Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
(VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
active control bits seems just enough.

Also, nested_vmx_check_host_state() (i.e. it is called in nested_vmx_run() to
return early if any HOST state is wrong) currently checks
kvm_vcpu_is_illegal_gpa(vcpu, cr3) too so we should also change it.

That being said, I do find it's not easy to come up with a "concise" changelog
to cover both non-nested and (especially) nested cases, but it seems we can
abstract some from my above typing?  

My suggestion is to focus on the hardware behaviour's perspective as typed
above.  But I believe Sean can easily make a "concise" changelog if he wants to
comment here :)
Chao Gao April 21, 2023, 3:32 p.m. UTC | #13
On Fri, Apr 21, 2023 at 07:43:52PM +0800, Huang, Kai wrote:
>On Fri, 2023-04-21 at 14:35 +0800, Binbin Wu wrote:
>> On 4/13/2023 5:13 PM, Huang, Kai wrote:
>> > > On 4/13/2023 10:27 AM, Huang, Kai wrote:
>> > > > On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
>> > > > > On 4/12/2023 7:58 PM, Huang, Kai wrote:
>> > > > > 
>> > > ...
>> > > > > > > +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>> > > > > > Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this
>> > > > > > way, below
>> > > > > > mmu_check_root() may potentially catch other invalid bits, but in
>> > > > > > practice there should be no difference I guess.
>> > > > > In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
>> > > > > 
>> > > > > However, Sean pointed out that the return value of
>> > > > > mmu->get_guest_pgd(vcpu) could be
>> > > > > EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
>> > > > Yes, although EPTP's high bits don't contain any control bits.
>> > > > 
>> > > > But perhaps we want to make it future-proof in case some more control
>> > > > bits are added to EPTP too.
>> > > > 
>> > > > > Since the guest pgd has been check for valadity, for both CR3 and
>> > > > > EPTP, it is safe to mask off non-address bits to get GFN.
>> > > > > 
>> > > > > Maybe I should add this CR3 VS. EPTP part to the changelog to make it
>> > > > > more undertandable.
>> > > > This isn't necessary, and can/should be done in comments if needed.
>> > > > 
>> > > > But IMHO you may want to add more material to explain how nested cases
>> > > > are handled.
>> > > Do you mean about CR3 or others?
>> > > 
>> > This patch is about CR3, so CR3.
>> 
>> For nested case, I plan to add the following in the changelog:
>> 
>>      For nested guest:
>>      - If CR3 is intercepted, after CR3 handled in L1, CR3 will be 
>> checked in
>>        nested_vmx_load_cr3() before returning back to L2.
>>      - For the shadow paging case (SPT02), LAM bits are also be handled 
>> to form
>>        a new shadow CR3 in vmx_load_mmu_pgd().
>> 
>> 
>
>I don't know a lot of code detail of KVM nested code, but in general, since your
>code only changes nested_vmx_load_cr3() and nested_vmx_check_host_state(), the
>changelog should focus on explaining why modifying these two functions are good
>enough.

the patch relaxes all existing checks on CR3, allowing previously reserved bits
to be set. It should be enough otherwise some checks on CR3 are missing in the
first place.

>
>And to explain this, I think we should explain from hardware's perspective
>rather than from shadow paging's perspective.
>
>From L0's perspective, we care about:
>
>	1) L1's CR3 register (VMCS01's GUEST_CR3)
>	2) L1's VMCS to run L2 guest
>		2.1) VMCS12's HOST_CR3 
>		2.2) VMCS12's GUEST_CR3
>
>For 1) the current changelog has explained (that we need to catch _active_
>control bits in guest's CR3 etc).  For nested (case 2)), L1 can choose to
>intercept CR3 or not.  But this isn't the point because from hardware's point of
>view we actually care about below two cases instead:
>
>	1) L0 to emulate a VMExit from L2 to L1 by VMENTER using VMCS01 
>	   to reflect VMCS12
>	2) L0 to VMENTER to L2 using VMCS02 directly
>
>The case 2) can fail due to fail to VMENTER to L2 of course but this should
>result in L0 to VMENTER to L1 with a emulated VMEXIT from L2 to L1 which is the
>case 1).
>
>For case 1) we need new code to check VMCS12's HOST_CR3 against guest's _active_
>CR3 feature control bits.  The key code path is:
>
>	vmx_handle_exit()
>		-> prepare_vmcs12()
>		-> load_vmcs12_host_state().  
>
>For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
>GUEST_CR3 against active control bits.  The key code path is 

...

>
>	nested_vmx_run() -> 
>		-> nested_vmx_check_host_state()
>		-> nested_vmx_enter_non_root_mode()
>			-> prepare_vmcs02_early()
>			-> prepare_vmcs02()
>
>Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
>(VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
>kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
>active control bits seems just enough.

IMO, current KVM relies on hardware to do consistency check on the GUEST_CR3
during VM entry.

>
>Also, nested_vmx_check_host_state() (i.e. it is called in nested_vmx_run() to
>return early if any HOST state is wrong) currently checks
>kvm_vcpu_is_illegal_gpa(vcpu, cr3) too so we should also change it.
>
>That being said, I do find it's not easy to come up with a "concise" changelog
>to cover both non-nested and (especially) nested cases, but it seems we can
>abstract some from my above typing?  
>
>My suggestion is to focus on the hardware behaviour's perspective as typed
>above.  But I believe Sean can easily make a "concise" changelog if he wants to
>comment here :)
Binbin Wu April 22, 2023, 3:32 a.m. UTC | #14
Kai,

Thanks for your inputs.

I rephrased the changelog as following:


LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
masking for user mode pointers.

To support LAM in KVM, CR3 validity checks and shadow paging handling 
need to be
modified accordingly.

== CR3 validity Check ==
When LAM is supported, CR3 LAM bits are allowed to be set and the check 
of CR3
needs to be modified.
Add a helper kvm_vcpu_is_legal_cr3() and use it instead of 
kvm_vcpu_is_legal_gpa()
to do the new CR3 checks in all existing CR3 checks as following:
When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs().
Non-nested case
- When EPT on, CR3 is fully under control of guest.
- When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3() 
during
   CR3 VMExit handling.
Nested case, from L0's perspective, we care about:
- L1's CR3 register (VMCS01's GUEST_CR3), it's the same as non-nested case.
- L1's VMCS to run L2 guest (i.e. VMCS12's HOST_CR3 and VMCS12's GUEST_CR3)
   Two paths related:
   1. L0 emulates a VMExit from L2 to L1 using VMCS01 to reflect VMCS12
          nested_vm_exit()
          -> load_vmcs12_host_state()
                -> nested_vmx_load_cr3()     //check VMCS12's HOST_CR3
   2. L0 to VMENTER to L2 using VMCS02
          nested_vmx_run()
          -> nested_vmx_check_host_state()   //check VMCS12's HOST_CR3
          -> nested_vmx_enter_non_root_mode()
             -> prepare_vmcs02()
                -> nested_vmx_load_cr3()     //check VMCS12's GUEST_CR3
   Path 2 can fail to VMENTER to L2 of course, but later this should 
result in
   path 1.

== Shadow paging handling ==
When EPT is off, the following changes needed to handle shadow paging:
- LAM bits should be stripped to perform GFN calculation from guest PGD 
when it
   is CR3 (for nested EPT case, guest PGD is nested EPTP).
   To be generic, extract the maximal base address mask from guest PGD 
since the
   validity of guest PGD has been checked already.
- Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
   It could potentially increase root cache misses and MMU reload, however,
   it's very rare to turn off EPT when performance matters.
- Active CR3 LAM bits should be kept to form a shadow CR3.

To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
the bits used to control supported features related to CR3 (e.g. LAM).
- Supported control bits are set to the field after set cpuid.
- the field is used in
   kvm_vcpu_is_legal_cr3() to check CR3.
   kvm_get_active_cr3_ctrl_bits() to extract active control bits of CR3.
   Also as a quick check for LAM feature support.

On 4/21/2023 7:43 PM, Huang, Kai wrote:
> On Fri, 2023-04-21 at 14:35 +0800, Binbin Wu wrote:
>> On 4/13/2023 5:13 PM, Huang, Kai wrote:
>>>> On 4/13/2023 10:27 AM, Huang, Kai wrote:
>>>>> On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
>>>>>> On 4/12/2023 7:58 PM, Huang, Kai wrote:
>>>>>>
>>>> ...
>>>>>>>> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>>>>>>> Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this
>>>>>>> way, below
>>>>>>> mmu_check_root() may potentially catch other invalid bits, but in
>>>>>>> practice there should be no difference I guess.
>>>>>> In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
>>>>>>
>>>>>> However, Sean pointed out that the return value of
>>>>>> mmu->get_guest_pgd(vcpu) could be
>>>>>> EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
>>>>> Yes, although EPTP's high bits don't contain any control bits.
>>>>>
>>>>> But perhaps we want to make it future-proof in case some more control
>>>>> bits are added to EPTP too.
>>>>>
>>>>>> Since the guest pgd has been check for valadity, for both CR3 and
>>>>>> EPTP, it is safe to mask off non-address bits to get GFN.
>>>>>>
>>>>>> Maybe I should add this CR3 VS. EPTP part to the changelog to make it
>>>>>> more undertandable.
>>>>> This isn't necessary, and can/should be done in comments if needed.
>>>>>
>>>>> But IMHO you may want to add more material to explain how nested cases
>>>>> are handled.
>>>> Do you mean about CR3 or others?
>>>>
>>> This patch is about CR3, so CR3.
>> For nested case, I plan to add the following in the changelog:
>>
>>       For nested guest:
>>       - If CR3 is intercepted, after CR3 handled in L1, CR3 will be
>> checked in
>>         nested_vmx_load_cr3() before returning back to L2.
>>       - For the shadow paging case (SPT02), LAM bits are also be handled
>> to form
>>         a new shadow CR3 in vmx_load_mmu_pgd().
>>
>>
> I don't know a lot of code detail of KVM nested code, but in general, since your
> code only changes nested_vmx_load_cr3() and nested_vmx_check_host_state(), the
> changelog should focus on explaining why modifying these two functions are good
> enough.
>
> And to explain this, I think we should explain from hardware's perspective
> rather than from shadow paging's perspective.
>
>  From L0's perspective, we care about:
>
> 	1) L1's CR3 register (VMCS01's GUEST_CR3)
> 	2) L1's VMCS to run L2 guest
> 		2.1) VMCS12's HOST_CR3
> 		2.2) VMCS12's GUEST_CR3
>
> For 1) the current changelog has explained (that we need to catch _active_
> control bits in guest's CR3 etc).  For nested (case 2)), L1 can choose to
> intercept CR3 or not.  But this isn't the point because from hardware's point of
> view we actually care about below two cases instead:
>
> 	1) L0 to emulate a VMExit from L2 to L1 by VMENTER using VMCS01
> 	   to reflect VMCS12
> 	2) L0 to VMENTER to L2 using VMCS02 directly
>
> The case 2) can fail due to fail to VMENTER to L2 of course but this should
> result in L0 to VMENTER to L1 with a emulated VMEXIT from L2 to L1 which is the
> case 1).
>
> For case 1) we need new code to check VMCS12's HOST_CR3 against guest's _active_
> CR3 feature control bits.  The key code path is:
>
> 	vmx_handle_exit()
> 		-> prepare_vmcs12()
> 		-> load_vmcs12_host_state().
>
> For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
> GUEST_CR3 against active control bits.  The key code path is
>
> 	nested_vmx_run() ->
> 		-> nested_vmx_check_host_state()
> 		-> nested_vmx_enter_non_root_mode()
> 			-> prepare_vmcs02_early()
> 			-> prepare_vmcs02()
>
> Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
> (VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
> kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
> active control bits seems just enough.
>
> Also, nested_vmx_check_host_state() (i.e. it is called in nested_vmx_run() to
> return early if any HOST state is wrong) currently checks
> kvm_vcpu_is_illegal_gpa(vcpu, cr3) too so we should also change it.
>
> That being said, I do find it's not easy to come up with a "concise" changelog
> to cover both non-nested and (especially) nested cases, but it seems we can
> abstract some from my above typing?
>
> My suggestion is to focus on the hardware behaviour's perspective as typed
> above.  But I believe Sean can easily make a "concise" changelog if he wants to
> comment here :)
Chao Gao April 22, 2023, 4:43 a.m. UTC | #15
On Sat, Apr 22, 2023 at 11:32:26AM +0800, Binbin Wu wrote:
>Kai,
>
>Thanks for your inputs.
>
>I rephrased the changelog as following:
>
>
>LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
>masking for user mode pointers.
>
>To support LAM in KVM, CR3 validity checks and shadow paging handling need to
>be
>modified accordingly.
>
>== CR3 validity Check ==
>When LAM is supported, CR3 LAM bits are allowed to be set and the check of
>CR3
>needs to be modified.

it is better to describe the hardware change here:

On processors that enumerate support for LAM, CR3 register allows
LAM_U48/U57 to be set and VM entry allows LAM_U48/U57 to be set in both
GUEST_CR3 and HOST_CR3 fields.

To emulate LAM hardware behavior, KVM needs to
1. allow LAM_U48/U57 to be set to the CR3 register by guest or userspace
2. allow LAM_U48/U57 to be set to the GUES_CR3/HOST_CR3 fields in vmcs12

>Add a helper kvm_vcpu_is_legal_cr3() and use it instead of
>kvm_vcpu_is_legal_gpa()
>to do the new CR3 checks in all existing CR3 checks as following:
>When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs().
>Non-nested case
>- When EPT on, CR3 is fully under control of guest.
>- When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3() during
>  CR3 VMExit handling.
>Nested case, from L0's perspective, we care about:
>- L1's CR3 register (VMCS01's GUEST_CR3), it's the same as non-nested case.
>- L1's VMCS to run L2 guest (i.e. VMCS12's HOST_CR3 and VMCS12's GUEST_CR3)
>  Two paths related:
>  1. L0 emulates a VMExit from L2 to L1 using VMCS01 to reflect VMCS12
>         nested_vm_exit()
>         -> load_vmcs12_host_state()
>               -> nested_vmx_load_cr3()     //check VMCS12's HOST_CR3

This is just a byproduct of using a unified function, i.e.,
nested_vmx_load_cr3() to load CR3 for both nested VM entry and VM exit.

LAM spec says:

VM entry checks the values of the CR3 and CR4 fields in the guest-area
and host-state area of the VMCS. In particular, the bits in these fields
that correspond to bits reserved in the corresponding register are
checked and must be 0.

It doesn't mention any check on VM exit. So, it looks to me that VM exit
doesn't do consistency checks. Then, I think there is no need to call
out this path.

>  2. L0 to VMENTER to L2 using VMCS02
>         nested_vmx_run()
>         -> nested_vmx_check_host_state()   //check VMCS12's HOST_CR3
>         -> nested_vmx_enter_non_root_mode()
>            -> prepare_vmcs02()
>               -> nested_vmx_load_cr3()     //check VMCS12's GUEST_CR3
>  Path 2 can fail to VMENTER to L2 of course, but later this should result in
>  path 1.
>
>== Shadow paging handling ==
>When EPT is off, the following changes needed to handle shadow paging:
>- LAM bits should be stripped to perform GFN calculation from guest PGD when
>it
>  is CR3 (for nested EPT case, guest PGD is nested EPTP).
>  To be generic, extract the maximal base address mask from guest PGD since
>the
>  validity of guest PGD has been checked already.
>- Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
>  It could potentially increase root cache misses and MMU reload, however,
>  it's very rare to turn off EPT when performance matters.
>- Active CR3 LAM bits should be kept to form a shadow CR3.
>
>To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
>the bits used to control supported features related to CR3 (e.g. LAM).
>- Supported control bits are set to the field after set cpuid.
>- the field is used in
>  kvm_vcpu_is_legal_cr3() to check CR3.
>  kvm_get_active_cr3_ctrl_bits() to extract active control bits of CR3.
>  Also as a quick check for LAM feature support.
>
>On 4/21/2023 7:43 PM, Huang, Kai wrote:
>> On Fri, 2023-04-21 at 14:35 +0800, Binbin Wu wrote:
>> > On 4/13/2023 5:13 PM, Huang, Kai wrote:
>> > > > On 4/13/2023 10:27 AM, Huang, Kai wrote:
>> > > > > On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
>> > > > > > On 4/12/2023 7:58 PM, Huang, Kai wrote:
>> > > > > > 
>> > > > ...
>> > > > > > > > +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>> > > > > > > Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this
>> > > > > > > way, below
>> > > > > > > mmu_check_root() may potentially catch other invalid bits, but in
>> > > > > > > practice there should be no difference I guess.
>> > > > > > In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
>> > > > > > 
>> > > > > > However, Sean pointed out that the return value of
>> > > > > > mmu->get_guest_pgd(vcpu) could be
>> > > > > > EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
>> > > > > Yes, although EPTP's high bits don't contain any control bits.
>> > > > > 
>> > > > > But perhaps we want to make it future-proof in case some more control
>> > > > > bits are added to EPTP too.
>> > > > > 
>> > > > > > Since the guest pgd has been check for valadity, for both CR3 and
>> > > > > > EPTP, it is safe to mask off non-address bits to get GFN.
>> > > > > > 
>> > > > > > Maybe I should add this CR3 VS. EPTP part to the changelog to make it
>> > > > > > more undertandable.
>> > > > > This isn't necessary, and can/should be done in comments if needed.
>> > > > > 
>> > > > > But IMHO you may want to add more material to explain how nested cases
>> > > > > are handled.
>> > > > Do you mean about CR3 or others?
>> > > > 
>> > > This patch is about CR3, so CR3.
>> > For nested case, I plan to add the following in the changelog:
>> > 
>> >       For nested guest:
>> >       - If CR3 is intercepted, after CR3 handled in L1, CR3 will be
>> > checked in
>> >         nested_vmx_load_cr3() before returning back to L2.
>> >       - For the shadow paging case (SPT02), LAM bits are also be handled
>> > to form
>> >         a new shadow CR3 in vmx_load_mmu_pgd().
>> > 
>> > 
>> I don't know a lot of code detail of KVM nested code, but in general, since your
>> code only changes nested_vmx_load_cr3() and nested_vmx_check_host_state(), the
>> changelog should focus on explaining why modifying these two functions are good
>> enough.
>> 
>> And to explain this, I think we should explain from hardware's perspective
>> rather than from shadow paging's perspective.
>> 
>>  From L0's perspective, we care about:
>> 
>> 	1) L1's CR3 register (VMCS01's GUEST_CR3)
>> 	2) L1's VMCS to run L2 guest
>> 		2.1) VMCS12's HOST_CR3
>> 		2.2) VMCS12's GUEST_CR3
>> 
>> For 1) the current changelog has explained (that we need to catch _active_
>> control bits in guest's CR3 etc).  For nested (case 2)), L1 can choose to
>> intercept CR3 or not.  But this isn't the point because from hardware's point of
>> view we actually care about below two cases instead:
>> 
>> 	1) L0 to emulate a VMExit from L2 to L1 by VMENTER using VMCS01
>> 	   to reflect VMCS12
>> 	2) L0 to VMENTER to L2 using VMCS02 directly
>> 
>> The case 2) can fail due to fail to VMENTER to L2 of course but this should
>> result in L0 to VMENTER to L1 with a emulated VMEXIT from L2 to L1 which is the
>> case 1).
>> 
>> For case 1) we need new code to check VMCS12's HOST_CR3 against guest's _active_
>> CR3 feature control bits.  The key code path is:
>> 
>> 	vmx_handle_exit()
>> 		-> prepare_vmcs12()
>> 		-> load_vmcs12_host_state().
>> 
>> For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
>> GUEST_CR3 against active control bits.  The key code path is
>> 
>> 	nested_vmx_run() ->
>> 		-> nested_vmx_check_host_state()
>> 		-> nested_vmx_enter_non_root_mode()
>> 			-> prepare_vmcs02_early()
>> 			-> prepare_vmcs02()
>> 
>> Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
>> (VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
>> kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
>> active control bits seems just enough.
>> 
>> Also, nested_vmx_check_host_state() (i.e. it is called in nested_vmx_run() to
>> return early if any HOST state is wrong) currently checks
>> kvm_vcpu_is_illegal_gpa(vcpu, cr3) too so we should also change it.
>> 
>> That being said, I do find it's not easy to come up with a "concise" changelog
>> to cover both non-nested and (especially) nested cases, but it seems we can
>> abstract some from my above typing?
>> 
>> My suggestion is to focus on the hardware behaviour's perspective as typed
>> above.  But I believe Sean can easily make a "concise" changelog if he wants to
>> comment here :)
Chao Gao April 22, 2023, 4:51 a.m. UTC | #16
On Fri, Apr 21, 2023 at 11:32:17PM +0800, Chao Gao wrote:
>>For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
>>GUEST_CR3 against active control bits.  The key code path is 
>
>...
>
>>
>>	nested_vmx_run() -> 
>>		-> nested_vmx_check_host_state()
>>		-> nested_vmx_enter_non_root_mode()
>>			-> prepare_vmcs02_early()
>>			-> prepare_vmcs02()
>>
>>Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
>>(VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
>>kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
>>active control bits seems just enough.
>
>IMO, current KVM relies on hardware to do consistency check on the GUEST_CR3
>during VM entry.

Please disregard my remark on the consistency check on GUEST_CR3. I just took
a closer look at the code. It is indeed done by KVM in nested_vmx_load_cr3().
Sorry for the noise.
Huang, Kai April 22, 2023, 8:14 a.m. UTC | #17
On Sat, 2023-04-22 at 12:51 +0800, Chao Gao wrote:
> On Fri, Apr 21, 2023 at 11:32:17PM +0800, Chao Gao wrote:
> > > For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
> > > GUEST_CR3 against active control bits.  The key code path is 
> > 
> > ...
> > 
> > > 
> > > 	nested_vmx_run() -> 
> > > 		-> nested_vmx_check_host_state()
> > > 		-> nested_vmx_enter_non_root_mode()
> > > 			-> prepare_vmcs02_early()
> > > 			-> prepare_vmcs02()
> > > 
> > > Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
> > > (VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
> > > kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
> > > active control bits seems just enough.
> > 
> > IMO, current KVM relies on hardware to do consistency check on the GUEST_CR3
> > during VM entry.
> 
> Please disregard my remark on the consistency check on GUEST_CR3. I just took
> a closer look at the code. It is indeed done by KVM in nested_vmx_load_cr3().
> Sorry for the noise.

Right.  You cannot just simply rely on hardware to do CR3 check in VMENTER,
because at least there's a nasty case that L1's MAXPHYSADDR can be smaller than
L0.
Huang, Kai April 25, 2023, 10:48 p.m. UTC | #18
On Sat, 2023-04-22 at 11:32 +0800, Binbin Wu wrote:
> Kai,
> 
> Thanks for your inputs.
> 
> I rephrased the changelog as following:

To me it seems there are too many details and should be trimmed down.  But
putting changelog aside, ...

> 
> 
> LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
> masking for user mode pointers.
> 
> To support LAM in KVM, CR3 validity checks and shadow paging handling 
> need to be
> modified accordingly.
> 
> == CR3 validity Check ==
> When LAM is supported, CR3 LAM bits are allowed to be set and the check 
> of CR3
> needs to be modified.
> Add a helper kvm_vcpu_is_legal_cr3() and use it instead of 
> kvm_vcpu_is_legal_gpa()
> to do the new CR3 checks in all existing CR3 checks as following:
> When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs().
> Non-nested case
> - When EPT on, CR3 is fully under control of guest.
> - When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3() 
> during
>    CR3 VMExit handling.

... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
KVM.

Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
seems there isn't because AFAICT the bits in CR4 are used to control super mode
linear address but not LAM in global?

So if it is true, then it appears hardware depends on CPUID purely to decide
whether to perform LAM or not.

Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
causing any trouble (because the hardware actually supports this feature)?

If it's true, it seems we should trap CR3 (at least loading) when hardware
supports LAM but it's not exposed to the guest, so that KVM can correctly reject
any LAM control bits when guest illegally does so?
Chao Gao April 26, 2023, 3:05 a.m. UTC | #19
On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
>... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
>KVM.
>
>Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
>seems there isn't because AFAICT the bits in CR4 are used to control super mode
>linear address but not LAM in global?

Right.

>
>So if it is true, then it appears hardware depends on CPUID purely to decide
>whether to perform LAM or not.
>
>Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
>hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
>causing any trouble (because the hardware actually supports this feature)?

Yes. But I think it is a non-issue ...

>
>If it's true, it seems we should trap CR3 (at least loading) when hardware
>supports LAM but it's not exposed to the guest, so that KVM can correctly reject
>any LAM control bits when guest illegally does so?
>

Other features which need no explicit enablement (like AVX and other
new instructions) have the same problem.

The impact is some guests can use features which they are not supposed
to use. Then they might be broken after migration or kvm's instruction
emulation. But they put themselves at stake, KVM shouldn't be blamed.

The downside of intercepting CR3 is the performance impact on existing
VMs (all with old CPU models and thus all have no LAM). If they are
migrated to LAM-capable parts in the future, they will suffer
performance drop even though they are good tenents (i.e., won't try to
use LAM).

IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
when EPT=on cannot outweigh the performance impact. So, I vote to
document in changelog or comments that:
A guest can enable LAM for userspace pointers when EPT=on even if LAM
isn't exposed to it. KVM doens't prevent this out of performance
consideration
Binbin Wu April 26, 2023, 5:13 a.m. UTC | #20
On 4/26/2023 11:05 AM, Chao Gao wrote:
> On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
>> ... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
>> KVM.
>>
>> Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
>> seems there isn't because AFAICT the bits in CR4 are used to control super mode
>> linear address but not LAM in global?
> Right.
>
>> So if it is true, then it appears hardware depends on CPUID purely to decide
>> whether to perform LAM or not.
>>
>> Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
>> hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
>> causing any trouble (because the hardware actually supports this feature)?
> Yes. But I think it is a non-issue ...
>
>> If it's true, it seems we should trap CR3 (at least loading) when hardware
>> supports LAM but it's not exposed to the guest, so that KVM can correctly reject
>> any LAM control bits when guest illegally does so?
>>
> Other features which need no explicit enablement (like AVX and other
> new instructions) have the same problem.
>
> The impact is some guests can use features which they are not supposed
> to use. Then they might be broken after migration or kvm's instruction
> emulation. But they put themselves at stake, KVM shouldn't be blamed.
Agree.

>
> The downside of intercepting CR3 is the performance impact on existing
> VMs (all with old CPU models and thus all have no LAM). If they are
> migrated to LAM-capable parts in the future, they will suffer
> performance drop even though they are good tenents (i.e., won't try to
> use LAM).
>
> IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
> when EPT=on cannot outweigh the performance impact. So, I vote to
> document in changelog or comments that:
> A guest can enable LAM for userspace pointers when EPT=on even if LAM
> isn't exposed to it. KVM doens't prevent this out of performance
> consideration

How about add the comments on the code:

+       /*
+        * A guest can enable LAM for userspace pointers when EPT=on on a
+        * processor supporting LAM even if LAM isn't exposed to it.
+        * KVM doesn't prevent this out of performance considerations.
+        */
         if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
                 vcpu->arch.cr3_ctrl_bits |= X86_CR3_LAM_U48 | 
X86_CR3_LAM_U57;
Huang, Kai April 26, 2023, 8:43 a.m. UTC | #21
On Wed, 2023-04-26 at 11:05 +0800, Gao, Chao wrote:
> On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
> > ... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
> > KVM.
> > 
> > Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
> > seems there isn't because AFAICT the bits in CR4 are used to control super mode
> > linear address but not LAM in global?
> 
> Right.
> 
> > 
> > So if it is true, then it appears hardware depends on CPUID purely to decide
> > whether to perform LAM or not.
> > 
> > Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
> > hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
> > causing any trouble (because the hardware actually supports this feature)?
> 
> Yes. But I think it is a non-issue ...
> 
> > 
> > If it's true, it seems we should trap CR3 (at least loading) when hardware
> > supports LAM but it's not exposed to the guest, so that KVM can correctly reject
> > any LAM control bits when guest illegally does so?
> > 
> 
> Other features which need no explicit enablement (like AVX and other
> new instructions) have the same problem.

OK.

> 
> The impact is some guests can use features which they are not supposed
> to use. Then they might be broken after migration or kvm's instruction
> emulation. But they put themselves at stake, KVM shouldn't be blamed.
> 
> The downside of intercepting CR3 is the performance impact on existing
> VMs (all with old CPU models and thus all have no LAM). If they are
> migrated to LAM-capable parts in the future, they will suffer
> performance drop even though they are good tenents (i.e., won't try to
> use LAM).
> 
> IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
> when EPT=on cannot outweigh the performance impact. So, I vote to
> document in changelog or comments that:
> A guest can enable LAM for userspace pointers when EPT=on even if LAM
> isn't exposed to it. KVM doens't prevent this out of performance
> consideration

Yeah performance impact is the concern.  I agree we can just call out this in 
changelog and/or comments.  Just want to make sure this is mentioned/discussed.

My main concern is, as (any) VMEXIT saves guest's CR3 to VMCS's GUEST_CR3, KVM
may see GUEST_CR3 containing invalid control bits (because KVM believes the
guest doesn't support those feature bits), and if KVM code carelessly uses
WARN() around those code, a malicious guest may be able to attack host, which
means we need to pay more attention to code review around GUEST_CR3 in the
future.

Anyway not intercepting CR3 is fine to me, and will leave this to others.
Huang, Kai April 26, 2023, 8:44 a.m. UTC | #22
On Wed, 2023-04-26 at 13:13 +0800, Binbin Wu wrote:
> 
> On 4/26/2023 11:05 AM, Chao Gao wrote:
> > On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
> > > ... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
> > > KVM.
> > > 
> > > Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
> > > seems there isn't because AFAICT the bits in CR4 are used to control super mode
> > > linear address but not LAM in global?
> > Right.
> > 
> > > So if it is true, then it appears hardware depends on CPUID purely to decide
> > > whether to perform LAM or not.
> > > 
> > > Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
> > > hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
> > > causing any trouble (because the hardware actually supports this feature)?
> > Yes. But I think it is a non-issue ...
> > 
> > > If it's true, it seems we should trap CR3 (at least loading) when hardware
> > > supports LAM but it's not exposed to the guest, so that KVM can correctly reject
> > > any LAM control bits when guest illegally does so?
> > > 
> > Other features which need no explicit enablement (like AVX and other
> > new instructions) have the same problem.
> > 
> > The impact is some guests can use features which they are not supposed
> > to use. Then they might be broken after migration or kvm's instruction
> > emulation. But they put themselves at stake, KVM shouldn't be blamed.
> Agree.
> 
> > 
> > The downside of intercepting CR3 is the performance impact on existing
> > VMs (all with old CPU models and thus all have no LAM). If they are
> > migrated to LAM-capable parts in the future, they will suffer
> > performance drop even though they are good tenents (i.e., won't try to
> > use LAM).
> > 
> > IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
> > when EPT=on cannot outweigh the performance impact. So, I vote to
> > document in changelog or comments that:
> > A guest can enable LAM for userspace pointers when EPT=on even if LAM
> > isn't exposed to it. KVM doens't prevent this out of performance
> > consideration
> 
> How about add the comments on the code:
> 
> +       /*
> +        * A guest can enable LAM for userspace pointers when EPT=on on a
> +        * processor supporting LAM even if LAM isn't exposed to it.
> +        * KVM doesn't prevent this out of performance considerations.
> +        */
>          if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
>                  vcpu->arch.cr3_ctrl_bits |= X86_CR3_LAM_U48 | 
> X86_CR3_LAM_U57;
> 
> 

I would say we should at least call out in the changelog, but I don't have
opinion on whether to have this comment around this code or not.
Binbin Wu April 26, 2023, 8:50 a.m. UTC | #23
On 4/26/2023 4:44 PM, Huang, Kai wrote:
> On Wed, 2023-04-26 at 13:13 +0800, Binbin Wu wrote:
>> On 4/26/2023 11:05 AM, Chao Gao wrote:
>>> On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
>>>> ... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
>>>> KVM.
>>>>
>>>> Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
>>>> seems there isn't because AFAICT the bits in CR4 are used to control super mode
>>>> linear address but not LAM in global?
>>> Right.
>>>
>>>> So if it is true, then it appears hardware depends on CPUID purely to decide
>>>> whether to perform LAM or not.
>>>>
>>>> Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
>>>> hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
>>>> causing any trouble (because the hardware actually supports this feature)?
>>> Yes. But I think it is a non-issue ...
>>>
>>>> If it's true, it seems we should trap CR3 (at least loading) when hardware
>>>> supports LAM but it's not exposed to the guest, so that KVM can correctly reject
>>>> any LAM control bits when guest illegally does so?
>>>>
>>> Other features which need no explicit enablement (like AVX and other
>>> new instructions) have the same problem.
>>>
>>> The impact is some guests can use features which they are not supposed
>>> to use. Then they might be broken after migration or kvm's instruction
>>> emulation. But they put themselves at stake, KVM shouldn't be blamed.
>> Agree.
>>
>>> The downside of intercepting CR3 is the performance impact on existing
>>> VMs (all with old CPU models and thus all have no LAM). If they are
>>> migrated to LAM-capable parts in the future, they will suffer
>>> performance drop even though they are good tenents (i.e., won't try to
>>> use LAM).
>>>
>>> IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
>>> when EPT=on cannot outweigh the performance impact. So, I vote to
>>> document in changelog or comments that:
>>> A guest can enable LAM for userspace pointers when EPT=on even if LAM
>>> isn't exposed to it. KVM doens't prevent this out of performance
>>> consideration
>> How about add the comments on the code:
>>
>> +       /*
>> +        * A guest can enable LAM for userspace pointers when EPT=on on a
>> +        * processor supporting LAM even if LAM isn't exposed to it.
>> +        * KVM doesn't prevent this out of performance considerations.
>> +        */
>>           if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
>>                   vcpu->arch.cr3_ctrl_bits |= X86_CR3_LAM_U48 |
>> X86_CR3_LAM_U57;
>>
>>
> I would say we should at least call out in the changelog, but I don't have
> opinion on whether to have this comment around this code or not.
OK, will also add it to changelog.
Binbin Wu April 26, 2023, 10:52 a.m. UTC | #24
On 4/26/2023 4:43 PM, Huang, Kai wrote:
> On Wed, 2023-04-26 at 11:05 +0800, Gao, Chao wrote:
>> On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
>>> ... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
>>> KVM.
>>>
>>> Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
>>> seems there isn't because AFAICT the bits in CR4 are used to control super mode
>>> linear address but not LAM in global?
>> Right.
>>
>>> So if it is true, then it appears hardware depends on CPUID purely to decide
>>> whether to perform LAM or not.
>>>
>>> Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
>>> hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
>>> causing any trouble (because the hardware actually supports this feature)?
>> Yes. But I think it is a non-issue ...
>>
>>> If it's true, it seems we should trap CR3 (at least loading) when hardware
>>> supports LAM but it's not exposed to the guest, so that KVM can correctly reject
>>> any LAM control bits when guest illegally does so?
>>>
>> Other features which need no explicit enablement (like AVX and other
>> new instructions) have the same problem.
> OK.
>
>> The impact is some guests can use features which they are not supposed
>> to use. Then they might be broken after migration or kvm's instruction
>> emulation. But they put themselves at stake, KVM shouldn't be blamed.
>>
>> The downside of intercepting CR3 is the performance impact on existing
>> VMs (all with old CPU models and thus all have no LAM). If they are
>> migrated to LAM-capable parts in the future, they will suffer
>> performance drop even though they are good tenents (i.e., won't try to
>> use LAM).
>>
>> IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
>> when EPT=on cannot outweigh the performance impact. So, I vote to
>> document in changelog or comments that:
>> A guest can enable LAM for userspace pointers when EPT=on even if LAM
>> isn't exposed to it. KVM doens't prevent this out of performance
>> consideration
> Yeah performance impact is the concern.  I agree we can just call out this in
> changelog and/or comments.  Just want to make sure this is mentioned/discussed.
>
> My main concern is, as (any) VMEXIT saves guest's CR3 to VMCS's GUEST_CR3, KVM
> may see GUEST_CR3 containing invalid control bits (because KVM believes the
> guest doesn't support those feature bits), and if KVM code carelessly uses
> WARN() around those code, a malicious guest may be able to attack host, which
> means we need to pay more attention to code review around GUEST_CR3 in the
> future.

How about do guest CR3 LAM bits check based on the capability of 
physical processor instead of vCPU?
That is KVM doesn't prevent guest from setting CR3 LAM bits if the 
processor supports LAM.
>
> Anyway not intercepting CR3 is fine to me, and will leave this to others.
>
Huang, Kai April 27, 2023, 1:19 p.m. UTC | #25
On Sat, 2023-04-22 at 12:43 +0800, Gao, Chao wrote:
> On Sat, Apr 22, 2023 at 11:32:26AM +0800, Binbin Wu wrote:
> > Kai,
> > 
> > Thanks for your inputs.
> > 
> > I rephrased the changelog as following:
> > 
> > 
> > LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
> > masking for user mode pointers.
> > 
> > To support LAM in KVM, CR3 validity checks and shadow paging handling need to
> > be
> > modified accordingly.
> > 
> > == CR3 validity Check ==
> > When LAM is supported, CR3 LAM bits are allowed to be set and the check of
> > CR3
> > needs to be modified.
> 
> it is better to describe the hardware change here:
> 
> On processors that enumerate support for LAM, CR3 register allows
> LAM_U48/U57 to be set and VM entry allows LAM_U48/U57 to be set in both
> GUEST_CR3 and HOST_CR3 fields.
> 
> To emulate LAM hardware behavior, KVM needs to
> 1. allow LAM_U48/U57 to be set to the CR3 register by guest or userspace
> 2. allow LAM_U48/U57 to be set to the GUES_CR3/HOST_CR3 fields in vmcs12

Agreed.  This is more clearer.

> 
> > Add a helper kvm_vcpu_is_legal_cr3() and use it instead of
> > kvm_vcpu_is_legal_gpa()
> > to do the new CR3 checks in all existing CR3 checks as following:
> > When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs().
> > Non-nested case
> > - When EPT on, CR3 is fully under control of guest.
> > - When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3() during
> >   CR3 VMExit handling.
> > Nested case, from L0's perspective, we care about:
> > - L1's CR3 register (VMCS01's GUEST_CR3), it's the same as non-nested case.
> > - L1's VMCS to run L2 guest (i.e. VMCS12's HOST_CR3 and VMCS12's GUEST_CR3)
> >   Two paths related:
> >   1. L0 emulates a VMExit from L2 to L1 using VMCS01 to reflect VMCS12
> >          nested_vm_exit()
> >          -> load_vmcs12_host_state()
> >                -> nested_vmx_load_cr3()     //check VMCS12's HOST_CR3
> 
> This is just a byproduct of using a unified function, i.e.,
> nested_vmx_load_cr3() to load CR3 for both nested VM entry and VM exit.
> 
> LAM spec says:
> 
> VM entry checks the values of the CR3 and CR4 fields in the guest-area
> and host-state area of the VMCS. In particular, the bits in these fields
> that correspond to bits reserved in the corresponding register are
> checked and must be 0.
> 
> It doesn't mention any check on VM exit. So, it looks to me that VM exit
> doesn't do consistency checks. Then, I think there is no need to call
> out this path.

But this isn't a true VMEXIT -- it is indeed a VMENTER from L0 to L1 using
VMCS01 but with an environment that allows L1 to run its VMEXIT handler just
like it received a VMEXIT from L2.

However I fully agree those code paths are details and shouldn't be changelog
material.

How about below changelog? 

Add support to allow guest to set two new CR3 non-address control bits to allow
guest to enable the new Intel CPU feature Linear Address Masking (LAM).

LAM modifies the checking that is applied to 64-bit linear addresses, allowing
software to use of the untranslated address bits for metadata.  For user mode
linear address, LAM uses two new CR3 non-address bits LAM_U48 (bit 62) and
LAM_U57 (bit 61) to configure the metadata bits for 4-level paging and 5-level
paging respectively.  LAM also changes VMENTER to allow both bits to be set in
VMCS's HOST_CR3 and GUEST_CR3 to support virtualization.

When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set any of
the two LAM control bits.  However when EPT is off, the actual CR3 used by the
guest is generated from the shadow MMU root which is different from the CR3 that
is *set* by the guest, and KVM needs to manually apply any active control bits
to VMCS's GUEST_CR3 based on the cached CR3 *seen* by the guest.

KVM manually checks guest's CR3 to make sure it points to a valid guest physical
address (i.e. to support smaller MAXPHYSADDR in the guest).  Extend this check
to allow the two LAM control bits to be set.  And to make such check generic,
introduce a new field 'cr3_ctrl_bits' to vcpu to record all feature control bits
that are allowed to be set by the guest.

In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3 and
GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when L0 enters
L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2 directly.  KVM also
manually checks VMCS12's HOST_CR3 and GUEST_CR3 being valid physical address.
Extend such check to allow the new LAM control bits too.

Note, LAM doesn't have a global enable bit in any control register to turn
on/off LAM completely, but purely depends on hardware's CPUID to determine
whether to perform LAM check or not.  That means, when EPT is on, even when KVM
doesn't expose LAM to guest, the guest can still set LAM control bits in CR3 w/o
causing problem.  This is an unfortunate virtualization hole.  KVM could choose
to intercept CR3 in this case and inject fault but this would hurt performance
when running a normal VM w/o LAM support.  This is undesirable.  Just choose to
let the guest do such illegal thing as the worst case is guest being killed when
KVM eventually find out such illegal behaviour and that is the guest to blame.
Huang, Kai April 27, 2023, 1:23 p.m. UTC | #26
On Wed, 2023-04-26 at 18:52 +0800, Binbin Wu wrote:
> 
> On 4/26/2023 4:43 PM, Huang, Kai wrote:
> > On Wed, 2023-04-26 at 11:05 +0800, Gao, Chao wrote:
> > > On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
> > > > ... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
> > > > KVM.
> > > > 
> > > > Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
> > > > seems there isn't because AFAICT the bits in CR4 are used to control super mode
> > > > linear address but not LAM in global?
> > > Right.
> > > 
> > > > So if it is true, then it appears hardware depends on CPUID purely to decide
> > > > whether to perform LAM or not.
> > > > 
> > > > Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
> > > > hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
> > > > causing any trouble (because the hardware actually supports this feature)?
> > > Yes. But I think it is a non-issue ...
> > > 
> > > > If it's true, it seems we should trap CR3 (at least loading) when hardware
> > > > supports LAM but it's not exposed to the guest, so that KVM can correctly reject
> > > > any LAM control bits when guest illegally does so?
> > > > 
> > > Other features which need no explicit enablement (like AVX and other
> > > new instructions) have the same problem.
> > OK.
> > 
> > > The impact is some guests can use features which they are not supposed
> > > to use. Then they might be broken after migration or kvm's instruction
> > > emulation. But they put themselves at stake, KVM shouldn't be blamed.
> > > 
> > > The downside of intercepting CR3 is the performance impact on existing
> > > VMs (all with old CPU models and thus all have no LAM). If they are
> > > migrated to LAM-capable parts in the future, they will suffer
> > > performance drop even though they are good tenents (i.e., won't try to
> > > use LAM).
> > > 
> > > IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
> > > when EPT=on cannot outweigh the performance impact. So, I vote to
> > > document in changelog or comments that:
> > > A guest can enable LAM for userspace pointers when EPT=on even if LAM
> > > isn't exposed to it. KVM doens't prevent this out of performance
> > > consideration
> > Yeah performance impact is the concern.  I agree we can just call out this in
> > changelog and/or comments.  Just want to make sure this is mentioned/discussed.
> > 
> > My main concern is, as (any) VMEXIT saves guest's CR3 to VMCS's GUEST_CR3, KVM
> > may see GUEST_CR3 containing invalid control bits (because KVM believes the
> > guest doesn't support those feature bits), and if KVM code carelessly uses
> > WARN() around those code, a malicious guest may be able to attack host, which
> > means we need to pay more attention to code review around GUEST_CR3 in the
> > future.
> 
> How about do guest CR3 LAM bits check based on the capability of 
> physical processor instead of vCPU?
> That is KVM doesn't prevent guest from setting CR3 LAM bits if the 
> processor supports LAM.
> > 
> 

Doesn't seem to have any benefit as this looks like another hack (which also
breaks the rule of virtualization) in order to (try to) fix a virutalization
hole.
Binbin Wu April 29, 2023, 4:56 a.m. UTC | #27
On 4/27/2023 9:19 PM, Huang, Kai wrote:
> On Sat, 2023-04-22 at 12:43 +0800, Gao, Chao wrote:
>> On Sat, Apr 22, 2023 at 11:32:26AM +0800, Binbin Wu wrote:
>>> Kai,
>>>
>>> Thanks for your inputs.
>>>
>>> I rephrased the changelog as following:
>>>
>>>
>>> LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
>>> masking for user mode pointers.
>>>
>>> To support LAM in KVM, CR3 validity checks and shadow paging handling need to
>>> be
>>> modified accordingly.
>>>
>>> == CR3 validity Check ==
>>> When LAM is supported, CR3 LAM bits are allowed to be set and the check of
>>> CR3
>>> needs to be modified.
>> it is better to describe the hardware change here:
>>
>> On processors that enumerate support for LAM, CR3 register allows
>> LAM_U48/U57 to be set and VM entry allows LAM_U48/U57 to be set in both
>> GUEST_CR3 and HOST_CR3 fields.
>>
>> To emulate LAM hardware behavior, KVM needs to
>> 1. allow LAM_U48/U57 to be set to the CR3 register by guest or userspace
>> 2. allow LAM_U48/U57 to be set to the GUES_CR3/HOST_CR3 fields in vmcs12
> Agreed.  This is more clearer.
>
>>> Add a helper kvm_vcpu_is_legal_cr3() and use it instead of
>>> kvm_vcpu_is_legal_gpa()
>>> to do the new CR3 checks in all existing CR3 checks as following:
>>> When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs().
>>> Non-nested case
>>> - When EPT on, CR3 is fully under control of guest.
>>> - When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3() during
>>>    CR3 VMExit handling.
>>> Nested case, from L0's perspective, we care about:
>>> - L1's CR3 register (VMCS01's GUEST_CR3), it's the same as non-nested case.
>>> - L1's VMCS to run L2 guest (i.e. VMCS12's HOST_CR3 and VMCS12's GUEST_CR3)
>>>    Two paths related:
>>>    1. L0 emulates a VMExit from L2 to L1 using VMCS01 to reflect VMCS12
>>>           nested_vm_exit()
>>>           -> load_vmcs12_host_state()
>>>                 -> nested_vmx_load_cr3()     //check VMCS12's HOST_CR3
>> This is just a byproduct of using a unified function, i.e.,
>> nested_vmx_load_cr3() to load CR3 for both nested VM entry and VM exit.
>>
>> LAM spec says:
>>
>> VM entry checks the values of the CR3 and CR4 fields in the guest-area
>> and host-state area of the VMCS. In particular, the bits in these fields
>> that correspond to bits reserved in the corresponding register are
>> checked and must be 0.
>>
>> It doesn't mention any check on VM exit. So, it looks to me that VM exit
>> doesn't do consistency checks. Then, I think there is no need to call
>> out this path.
> But this isn't a true VMEXIT -- it is indeed a VMENTER from L0 to L1 using
> VMCS01 but with an environment that allows L1 to run its VMEXIT handler just
> like it received a VMEXIT from L2.
>
> However I fully agree those code paths are details and shouldn't be changelog
> material.
>
> How about below changelog?
>
> Add support to allow guest to set two new CR3 non-address control bits to allow
> guest to enable the new Intel CPU feature Linear Address Masking (LAM).
>
> LAM modifies the checking that is applied to 64-bit linear addresses, allowing
> software to use of the untranslated address bits for metadata.  For user mode
> linear address, LAM uses two new CR3 non-address bits LAM_U48 (bit 62) and
> LAM_U57 (bit 61) to configure the metadata bits for 4-level paging and 5-level
> paging respectively.  LAM also changes VMENTER to allow both bits to be set in
> VMCS's HOST_CR3 and GUEST_CR3 to support virtualization.
>
> When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set any of
> the two LAM control bits.  However when EPT is off, the actual CR3 used by the
> guest is generated from the shadow MMU root which is different from the CR3 that
> is *set* by the guest, and KVM needs to manually apply any active control bits
> to VMCS's GUEST_CR3 based on the cached CR3 *seen* by the guest.
>
> KVM manually checks guest's CR3 to make sure it points to a valid guest physical
> address (i.e. to support smaller MAXPHYSADDR in the guest).  Extend this check
> to allow the two LAM control bits to be set.  And to make such check generic,
> introduce a new field 'cr3_ctrl_bits' to vcpu to record all feature control bits
> that are allowed to be set by the guest.
>
> In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3 and
> GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when L0 enters
> L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2 directly.  KVM also
> manually checks VMCS12's HOST_CR3 and GUEST_CR3 being valid physical address.
> Extend such check to allow the new LAM control bits too.
>
> Note, LAM doesn't have a global enable bit in any control register to turn
> on/off LAM completely, but purely depends on hardware's CPUID to determine
> whether to perform LAM check or not.  That means, when EPT is on, even when KVM
> doesn't expose LAM to guest, the guest can still set LAM control bits in CR3 w/o
> causing problem.  This is an unfortunate virtualization hole.  KVM could choose
> to intercept CR3 in this case and inject fault but this would hurt performance
> when running a normal VM w/o LAM support.  This is undesirable.  Just choose to
> let the guest do such illegal thing as the worst case is guest being killed when
> KVM eventually find out such illegal behaviour and that is the guest to blame.
Thanks for the advice.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ba594f9ea414..498d2b5e8dc1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -729,6 +729,12 @@  struct kvm_vcpu_arch {
 	unsigned long cr0_guest_owned_bits;
 	unsigned long cr2;
 	unsigned long cr3;
+	/*
+	 * Bits in CR3 used to enable certain features. These bits are allowed
+	 * to be set in CR3 when vCPU supports the features. When shadow paging
+	 * is used, these bits should be kept as they are in the shadow CR3.
+	 */
+	u64 cr3_ctrl_bits;
 	unsigned long cr4;
 	unsigned long cr4_guest_owned_bits;
 	unsigned long cr4_guest_rsvd_bits;
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index b1658c0de847..ef8e1b912d7d 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -42,6 +42,11 @@  static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
 	return vcpu->arch.maxphyaddr;
 }
 
+static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
+{
+	return !((cr3 & vcpu->arch.reserved_gpa_bits) & ~vcpu->arch.cr3_ctrl_bits);
+}
+
 static inline bool kvm_vcpu_is_legal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
 {
 	return !(gpa & vcpu->arch.reserved_gpa_bits);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 168c46fd8dd1..29985eeb8e12 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -142,6 +142,11 @@  static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
 	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
 }
 
+static inline u64 kvm_get_active_cr3_ctrl_bits(struct kvm_vcpu *vcpu)
+{
+	return kvm_read_cr3(vcpu) & vcpu->arch.cr3_ctrl_bits;
+}
+
 static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
 {
 	u64 root_hpa = vcpu->arch.mmu->root.hpa;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8ebe542c565..de2c51a0b611 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3732,7 +3732,11 @@  static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	hpa_t root;
 
 	root_pgd = mmu->get_guest_pgd(vcpu);
-	root_gfn = root_pgd >> PAGE_SHIFT;
+	/*
+	* The guest PGD has already been checked for validity, unconditionally
+	* strip non-address bits when computing the GFN.
+	*/
+	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
 
 	if (mmu_check_root(vcpu, root_gfn))
 		return 1;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index cc58631e2336..c0479cbc2ca3 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -21,6 +21,7 @@  extern bool dbg;
 #endif
 
 /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
+#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
 #define __PT_LEVEL_SHIFT(level, bits_per_level)	\
 	(PAGE_SHIFT + ((level) - 1) * (bits_per_level))
 #define __PT_INDEX(address, level, bits_per_level) \
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 57f0b75c80f9..88351ba04249 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -62,7 +62,7 @@ 
 #endif
 
 /* Common logic, but per-type values.  These also need to be undefined. */
-#define PT_BASE_ADDR_MASK	((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)))
+#define PT_BASE_ADDR_MASK	((pt_element_t)__PT_BASE_ADDR_MASK)
 #define PT_LVL_ADDR_MASK(lvl)	__PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
 #define PT_LVL_OFFSET_MASK(lvl)	__PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
 #define PT_INDEX(addr, lvl)	__PT_INDEX(addr, lvl, PT_LEVEL_BITS)
@@ -324,6 +324,10 @@  static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	trace_kvm_mmu_pagetable_walk(addr, access);
 retry_walk:
 	walker->level = mmu->cpu_role.base.level;
+	/*
+	 * No need to mask cr3_ctrl_bits, gpte_to_gfn() will strip
+	 * non-address bits.
+	 */
 	pte           = mmu->get_guest_pgd(vcpu);
 	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1279db2eab44..777f7d443e3b 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -36,7 +36,7 @@  static_assert(SPTE_TDP_AD_ENABLED == 0);
 #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
 #define SPTE_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
 #else
-#define SPTE_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
+#define SPTE_BASE_ADDR_MASK __PT_BASE_ADDR_MASK
 #endif
 
 #define SPTE_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 1bc2b80273c9..d35bda9610e2 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1079,7 +1079,7 @@  static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
 			       bool nested_ept, bool reload_pdptrs,
 			       enum vm_entry_failure_code *entry_failure_code)
 {
-	if (CC(kvm_vcpu_is_illegal_gpa(vcpu, cr3))) {
+	if (CC(!kvm_vcpu_is_legal_cr3(vcpu, cr3))) {
 		*entry_failure_code = ENTRY_FAIL_DEFAULT;
 		return -EINVAL;
 	}
@@ -2907,7 +2907,7 @@  static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
 
 	if (CC(!nested_host_cr0_valid(vcpu, vmcs12->host_cr0)) ||
 	    CC(!nested_host_cr4_valid(vcpu, vmcs12->host_cr4)) ||
-	    CC(kvm_vcpu_is_illegal_gpa(vcpu, vmcs12->host_cr3)))
+	    CC(!kvm_vcpu_is_legal_cr3(vcpu, vmcs12->host_cr3)))
 		return -EINVAL;
 
 	if (CC(is_noncanonical_address(vmcs12->host_ia32_sysenter_esp, vcpu)) ||
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 42f163862a0f..4d329ee9474c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3388,7 +3388,8 @@  static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			update_guest_cr3 = false;
 		vmx_ept_load_pdptrs(vcpu);
 	} else {
-		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu);
+		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu) |
+		            kvm_get_active_cr3_ctrl_bits(vcpu);
 	}
 
 	if (update_guest_cr3)
@@ -7763,6 +7764,9 @@  static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		vmx->msr_ia32_feature_control_valid_bits &=
 			~FEAT_CTL_SGX_LC_ENABLED;
 
+	if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
+		vcpu->arch.cr3_ctrl_bits |= X86_CR3_LAM_U48 | X86_CR3_LAM_U57;
+
 	/* Refresh #PF interception to account for MAXPHYADDR changes. */
 	vmx_update_exception_bitmap(vcpu);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7713420abab0..aca255e69d0d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1260,7 +1260,7 @@  int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
 	 * the current vCPU mode is accurate.
 	 */
-	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
+	if (!kvm_vcpu_is_legal_cr3(vcpu, cr3))
 		return 1;
 
 	if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, cr3))
@@ -11349,7 +11349,7 @@  static bool kvm_is_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 		 */
 		if (!(sregs->cr4 & X86_CR4_PAE) || !(sregs->efer & EFER_LMA))
 			return false;
-		if (kvm_vcpu_is_illegal_gpa(vcpu, sregs->cr3))
+		if (!kvm_vcpu_is_legal_cr3(vcpu, sregs->cr3))
 			return false;
 	} else {
 		/*