diff mbox

[v3,2/5] KVM: MMU: check guest CR3 reserved bits based on its physical address width.

Message ID 1503577676-12345-3-git-send-email-yu.c.zhang@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yu Zhang Aug. 24, 2017, 12:27 p.m. UTC
Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the
reserved bits in CR3. Yet the length of reserved bits in
guest CR3 should be based on the physical address width
exposed to the VM. This patch changes CR3 check logic to
calculate the reserved bits at runtime.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/emulate.c          | 14 ++++++++++++--
 arch/x86/kvm/mmu.h              |  3 +++
 arch/x86/kvm/x86.c              |  8 ++++----
 4 files changed, 19 insertions(+), 7 deletions(-)

Comments

Paolo Bonzini Aug. 24, 2017, 1:40 p.m. UTC | #1
On 24/08/2017 14:27, Yu Zhang wrote:
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 3ed6192..67e7ec2 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -48,6 +48,9 @@
>  
>  static inline u64 rsvd_bits(int s, int e)
>  {
> +	if (e < s)
> +		return 0;
> +
>  	return ((1ULL << (e - s + 1)) - 1) << s;
>  }

e = s - 1 is already supported; why do you need e <= s - 2?

Paolo
Yu Zhang Aug. 24, 2017, 3:23 p.m. UTC | #2
On 8/24/2017 9:40 PM, Paolo Bonzini wrote:
> On 24/08/2017 14:27, Yu Zhang wrote:
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index 3ed6192..67e7ec2 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -48,6 +48,9 @@
>>   
>>   static inline u64 rsvd_bits(int s, int e)
>>   {
>> +	if (e < s)
>> +		return 0;
>> +
>>   	return ((1ULL << (e - s + 1)) - 1) << s;
>>   }
> e = s - 1 is already supported; why do you need e <= s - 2?

Sorry? I do not quite understand. When will e = s - 1?

Thanks
Yu
> Paolo
>
Yu Zhang Aug. 24, 2017, 3:38 p.m. UTC | #3
On 8/24/2017 11:50 PM, Paolo Bonzini wrote:
> On 24/08/2017 17:23, Yu Zhang wrote:
>>>>      static inline u64 rsvd_bits(int s, int e)
>>>>    {
>>>> +    if (e < s)
>>>> +        return 0;
>>>> +
>>>>        return ((1ULL << (e - s + 1)) - 1) << s;
>>>>    }
>>> e = s - 1 is already supported; why do you need e <= s - 2?
>> Sorry? I do not quite understand. When will e = s - 1?
> Is there any case where e < s?  I can see that MAXPHYADDR=63 gives
> rsvd_bits(63, 62), but that works.
>
> In practice, MAXPHYADDR will never be 59 even because the PKRU bits are
> at bits 59..62.

Thanks, Paolo.
I see. I had made an assumption that MAXPHYADDR shall not exceed the 
physical one,
which is 52 I believe. But I'm not sure there's any place to check this.
Maybe we should make sure the vcpu->arch.maxphyaddr will not be greater 
than the
value of the host?

Yu
>
> Paolo
>
Paolo Bonzini Aug. 24, 2017, 3:50 p.m. UTC | #4
On 24/08/2017 17:23, Yu Zhang wrote:
>>>
>>>     static inline u64 rsvd_bits(int s, int e)
>>>   {
>>> +    if (e < s)
>>> +        return 0;
>>> +
>>>       return ((1ULL << (e - s + 1)) - 1) << s;
>>>   }
>> e = s - 1 is already supported; why do you need e <= s - 2?
> 
> Sorry? I do not quite understand. When will e = s - 1?

Is there any case where e < s?  I can see that MAXPHYADDR=63 gives
rsvd_bits(63, 62), but that works.

In practice, MAXPHYADDR will never be 59 even because the PKRU bits are
at bits 59..62.

Paolo
Yu Zhang Aug. 24, 2017, 4:21 p.m. UTC | #5
On 8/25/2017 12:27 AM, Paolo Bonzini wrote:
> On 24/08/2017 17:38, Yu Zhang wrote:
>>>
>>> In practice, MAXPHYADDR will never be 59 even because the PKRU bits are
>>> at bits 59..62.
>> Thanks, Paolo.
>> I see. I had made an assumption that MAXPHYADDR shall not exceed the
>> physical one,
>> which is 52 I believe. But I'm not sure there's any place to check this.
>> Maybe we should make sure the vcpu->arch.maxphyaddr will not be greater
>> than the value of the host?
> That's a separate change anyway.  In any case, since currently the
> MAXPHYADDR is not validated, your change to rsvd_bits makes sense.

Thanks, Paolo.
As to this patch series, any change I need do?

BTW,  I have written a patch for kvm-unit-test access test, but the test 
failed.
Not sure if my patch is erroneous or due to a simulator error. I'll send 
out the
test patch after it works.:-)

Yu
> Paolo
>
Paolo Bonzini Aug. 24, 2017, 4:27 p.m. UTC | #6
On 24/08/2017 17:38, Yu Zhang wrote:
>>
>>
>> In practice, MAXPHYADDR will never be 59 even because the PKRU bits are
>> at bits 59..62.
> 
> Thanks, Paolo.
> I see. I had made an assumption that MAXPHYADDR shall not exceed the
> physical one,
> which is 52 I believe. But I'm not sure there's any place to check this.
> Maybe we should make sure the vcpu->arch.maxphyaddr will not be greater
> than the value of the host?

That's a separate change anyway.  In any case, since currently the
MAXPHYADDR is not validated, your change to rsvd_bits makes sense.

Paolo
Paolo Bonzini Aug. 24, 2017, 4:51 p.m. UTC | #7
On 24/08/2017 18:21, Yu Zhang wrote:
> 
> 
> On 8/25/2017 12:27 AM, Paolo Bonzini wrote:
>> On 24/08/2017 17:38, Yu Zhang wrote:
>>>>
>>>> In practice, MAXPHYADDR will never be 59 even because the PKRU bits are
>>>> at bits 59..62.
>>> Thanks, Paolo.
>>> I see. I had made an assumption that MAXPHYADDR shall not exceed the
>>> physical one,
>>> which is 52 I believe. But I'm not sure there's any place to check this.
>>> Maybe we should make sure the vcpu->arch.maxphyaddr will not be greater
>>> than the value of the host?
>> That's a separate change anyway.  In any case, since currently the
>> MAXPHYADDR is not validated, your change to rsvd_bits makes sense.
> 
> Thanks, Paolo.
> As to this patch series, any change I need do?

No, it's fine.

> BTW,  I have written a patch for kvm-unit-test access test, but the test
> failed.
> Not sure if my patch is erroneous or due to a simulator error. I'll send
> out the
> test patch after it works.:-)

Try to send it.  I can also test it with QEMU TCG.

Paolo
Jim Mattson Sept. 15, 2017, 11:19 p.m. UTC | #8
On Thu, Aug 24, 2017 at 5:27 AM, Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the
> reserved bits in CR3. Yet the length of reserved bits in
> guest CR3 should be based on the physical address width
> exposed to the VM. This patch changes CR3 check logic to
> calculate the reserved bits at runtime.
>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 -
>  arch/x86/kvm/emulate.c          | 14 ++++++++++++--
>  arch/x86/kvm/mmu.h              |  3 +++
>  arch/x86/kvm/x86.c              |  8 ++++----
>  4 files changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6db0ed9..e716228 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -79,7 +79,6 @@
>                           | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
>                           | X86_CR0_NW | X86_CR0_CD | X86_CR0_PG))
>
> -#define CR3_L_MODE_RESERVED_BITS 0xFFFFFF0000000000ULL
>  #define CR3_PCID_INVD           BIT_64(63)
>  #define CR4_RESERVED_BITS                                               \
>         (~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 319d91f..a89b595 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -28,6 +28,7 @@
>
>  #include "x86.h"
>  #include "tss.h"
> +#include "mmu.h"
>
>  /*
>   * Operand types
> @@ -4097,8 +4098,17 @@ static int check_cr_write(struct x86_emulate_ctxt *ctxt)
>                 u64 rsvd = 0;
>
>                 ctxt->ops->get_msr(ctxt, MSR_EFER, &efer);
> -               if (efer & EFER_LMA)
> -                       rsvd = CR3_L_MODE_RESERVED_BITS & ~CR3_PCID_INVD;
> +               if (efer & EFER_LMA) {
> +                       u64 maxphyaddr;
> +                       u32 eax = 0x80000008;
> +
> +                       if (ctxt->ops->get_cpuid(ctxt, &eax, NULL, NULL,
> +                                                NULL, false))

Passing NULL for the address of ecx looks problematic to me.

We have:

static bool emulator_get_cpuid(struct x86_emulate_ctxt *ctxt,
                        u32 *eax, u32 *ebx, u32 *ecx, u32 *edx, bool
check_limit)
{
        return kvm_cpuid(emul_to_vcpu(ctxt), eax, ebx, ecx, edx, check_limit);
}

And:

bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
       u32 *ecx, u32 *edx, bool check_limit)
{
u32 function = *eax, index = *ecx;
struct kvm_cpuid_entry2 *best;
bool entry_found = true;
...

Doesn't this immediately try to dereference a NULL pointer?  How much
testing have you done of this code?

> +                               maxphyaddr = eax & 0xff;
> +                       else
> +                               maxphyaddr = 36;
> +                       rsvd = rsvd_bits(maxphyaddr, 62);
> +               }
>
>                 if (new_val & rsvd)
>                         return emulate_gp(ctxt, 0);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 3ed6192..67e7ec2 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -48,6 +48,9 @@
>
>  static inline u64 rsvd_bits(int s, int e)
>  {
> +       if (e < s)
> +               return 0;
> +
>         return ((1ULL << (e - s + 1)) - 1) << s;
>  }
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cc2c7e4..79f5889 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -813,10 +813,10 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>                 return 0;
>         }
>
> -       if (is_long_mode(vcpu)) {
> -               if (cr3 & CR3_L_MODE_RESERVED_BITS)
> -                       return 1;
> -       } else if (is_pae(vcpu) && is_paging(vcpu) &&
> +       if (is_long_mode(vcpu) &&
> +           (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 62)))
> +               return 1;
> +       else if (is_pae(vcpu) && is_paging(vcpu) &&
>                    !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
>                 return 1;
>
> --
> 2.5.0
>
Yu Zhang Sept. 18, 2017, 8:15 a.m. UTC | #9
On 9/16/2017 7:19 AM, Jim Mattson wrote:
> On Thu, Aug 24, 2017 at 5:27 AM, Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>> Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the
>> reserved bits in CR3. Yet the length of reserved bits in
>> guest CR3 should be based on the physical address width
>> exposed to the VM. This patch changes CR3 check logic to
>> calculate the reserved bits at runtime.
>>
>> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>> ---
>>   arch/x86/include/asm/kvm_host.h |  1 -
>>   arch/x86/kvm/emulate.c          | 14 ++++++++++++--
>>   arch/x86/kvm/mmu.h              |  3 +++
>>   arch/x86/kvm/x86.c              |  8 ++++----
>>   4 files changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 6db0ed9..e716228 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -79,7 +79,6 @@
>>                            | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
>>                            | X86_CR0_NW | X86_CR0_CD | X86_CR0_PG))
>>
>> -#define CR3_L_MODE_RESERVED_BITS 0xFFFFFF0000000000ULL
>>   #define CR3_PCID_INVD           BIT_64(63)
>>   #define CR4_RESERVED_BITS                                               \
>>          (~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\
>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>> index 319d91f..a89b595 100644
>> --- a/arch/x86/kvm/emulate.c
>> +++ b/arch/x86/kvm/emulate.c
>> @@ -28,6 +28,7 @@
>>
>>   #include "x86.h"
>>   #include "tss.h"
>> +#include "mmu.h"
>>
>>   /*
>>    * Operand types
>> @@ -4097,8 +4098,17 @@ static int check_cr_write(struct x86_emulate_ctxt *ctxt)
>>                  u64 rsvd = 0;
>>
>>                  ctxt->ops->get_msr(ctxt, MSR_EFER, &efer);
>> -               if (efer & EFER_LMA)
>> -                       rsvd = CR3_L_MODE_RESERVED_BITS & ~CR3_PCID_INVD;
>> +               if (efer & EFER_LMA) {
>> +                       u64 maxphyaddr;
>> +                       u32 eax = 0x80000008;
>> +
>> +                       if (ctxt->ops->get_cpuid(ctxt, &eax, NULL, NULL,
>> +                                                NULL, false))
> Passing NULL for the address of ecx looks problematic to me.
>
> We have:
>
> static bool emulator_get_cpuid(struct x86_emulate_ctxt *ctxt,
>                          u32 *eax, u32 *ebx, u32 *ecx, u32 *edx, bool
> check_limit)
> {
>          return kvm_cpuid(emul_to_vcpu(ctxt), eax, ebx, ecx, edx, check_limit);
> }
>
> And:
>
> bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
>         u32 *ecx, u32 *edx, bool check_limit)
> {
> u32 function = *eax, index = *ecx;
> struct kvm_cpuid_entry2 *best;
> bool entry_found = true;
> ...
>
> Doesn't this immediately try to dereference a NULL pointer?  How much
> testing have you done of this code?

Thanks Jim.
I have tested this code in a simulator to successfully boot a VM in 
shadow mode.
Seems this code is not covered(but I am now still perplexed why this is 
not covered).
Any possibility that the check_cr_write() is not triggered when 
emulating the cr
operations?

Anyway, this should be a bug and thanks for pointing this out, and I'll 
send out the
fix later.

BR
Yu
Paolo Bonzini Sept. 18, 2017, 8:41 a.m. UTC | #10
On 18/09/2017 10:15, Yu Zhang wrote:
>>
>> static bool emulator_get_cpuid(struct x86_emulate_ctxt *ctxt,
>>                          u32 *eax, u32 *ebx, u32 *ecx, u32 *edx, bool
>> check_limit)
>> {
>>          return kvm_cpuid(emul_to_vcpu(ctxt), eax, ebx, ecx, edx,
>> check_limit);
>> }
>>
>> And:
>>
>> bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
>>         u32 *ecx, u32 *edx, bool check_limit)
>> {
>> u32 function = *eax, index = *ecx;
>> struct kvm_cpuid_entry2 *best;
>> bool entry_found = true;
>> ...
>>
>> Doesn't this immediately try to dereference a NULL pointer?  How much
>> testing have you done of this code?
> 
> Thanks Jim.
> I have tested this code in a simulator to successfully boot a VM in 
> shadow mode. Seems this code is not covered(but I am now still
> perplexed why this is not covered). Any possibility that the
> check_cr_write() is not triggered when emulating the cr operations?

CR moves usually don't go through the emulator (the main exception is
emulation of invalid guest state when the processor doesn't support
unrestricted_guest=1, but even that is unlikely to happen with
EFER.LMA=1).  This explains why you didn't see the failure.

> Anyway, this should be a bug and thanks for pointing this out, and
> I'll send out the fix later.
Thanks,

Paolo
Yu Zhang Sept. 18, 2017, 9:35 a.m. UTC | #11
On 9/18/2017 4:41 PM, Paolo Bonzini wrote:
> On 18/09/2017 10:15, Yu Zhang wrote:
>>> static bool emulator_get_cpuid(struct x86_emulate_ctxt *ctxt,
>>>                           u32 *eax, u32 *ebx, u32 *ecx, u32 *edx, bool
>>> check_limit)
>>> {
>>>           return kvm_cpuid(emul_to_vcpu(ctxt), eax, ebx, ecx, edx,
>>> check_limit);
>>> }
>>>
>>> And:
>>>
>>> bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
>>>          u32 *ecx, u32 *edx, bool check_limit)
>>> {
>>> u32 function = *eax, index = *ecx;
>>> struct kvm_cpuid_entry2 *best;
>>> bool entry_found = true;
>>> ...
>>>
>>> Doesn't this immediately try to dereference a NULL pointer?  How much
>>> testing have you done of this code?
>> Thanks Jim.
>> I have tested this code in a simulator to successfully boot a VM in
>> shadow mode. Seems this code is not covered(but I am now still
>> perplexed why this is not covered). Any possibility that the
>> check_cr_write() is not triggered when emulating the cr operations?
> CR moves usually don't go through the emulator (the main exception is
> emulation of invalid guest state when the processor doesn't support
> unrestricted_guest=1, but even that is unlikely to happen with
> EFER.LMA=1).  This explains why you didn't see the failure.

Oh, right. It normally goes to handle_cr(). Thanks, Paolo.

Yu

>
>> Anyway, this should be a bug and thanks for pointing this out, and
>> I'll send out the fix later.
> Thanks,
>
> Paolo
>
diff mbox

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6db0ed9..e716228 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -79,7 +79,6 @@ 
 			  | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
 			  | X86_CR0_NW | X86_CR0_CD | X86_CR0_PG))
 
-#define CR3_L_MODE_RESERVED_BITS 0xFFFFFF0000000000ULL
 #define CR3_PCID_INVD		 BIT_64(63)
 #define CR4_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 319d91f..a89b595 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -28,6 +28,7 @@ 
 
 #include "x86.h"
 #include "tss.h"
+#include "mmu.h"
 
 /*
  * Operand types
@@ -4097,8 +4098,17 @@  static int check_cr_write(struct x86_emulate_ctxt *ctxt)
 		u64 rsvd = 0;
 
 		ctxt->ops->get_msr(ctxt, MSR_EFER, &efer);
-		if (efer & EFER_LMA)
-			rsvd = CR3_L_MODE_RESERVED_BITS & ~CR3_PCID_INVD;
+		if (efer & EFER_LMA) {
+			u64 maxphyaddr;
+			u32 eax = 0x80000008;
+
+			if (ctxt->ops->get_cpuid(ctxt, &eax, NULL, NULL,
+						 NULL, false))
+				maxphyaddr = eax & 0xff;
+			else
+				maxphyaddr = 36;
+			rsvd = rsvd_bits(maxphyaddr, 62);
+		}
 
 		if (new_val & rsvd)
 			return emulate_gp(ctxt, 0);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 3ed6192..67e7ec2 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -48,6 +48,9 @@ 
 
 static inline u64 rsvd_bits(int s, int e)
 {
+	if (e < s)
+		return 0;
+
 	return ((1ULL << (e - s + 1)) - 1) << s;
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cc2c7e4..79f5889 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -813,10 +813,10 @@  int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		return 0;
 	}
 
-	if (is_long_mode(vcpu)) {
-		if (cr3 & CR3_L_MODE_RESERVED_BITS)
-			return 1;
-	} else if (is_pae(vcpu) && is_paging(vcpu) &&
+	if (is_long_mode(vcpu) &&
+	    (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 62)))
+		return 1;
+	else if (is_pae(vcpu) && is_paging(vcpu) &&
 		   !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
 		return 1;