diff mbox

KVM: nVMX: Rework event injection and recovery

Message ID 5124C93B.50902@siemens.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jan Kiszka Feb. 20, 2013, 1:01 p.m. UTC
This aligns VMX more with SVM regarding event injection and recovery for
nested guests. The changes allow to inject interrupts directly from L0
to L2.

One difference to SVM is that we always transfer the pending event
injection into the architectural state of the VCPU and then drop it from
there if it turns out that we left L2 to enter L1.

VMX and SVM are now identical in how they recover event injections from
unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
still contains a valid event and, if yes, transfer the content into L1's
idt_vectoring_info_field.

To avoid that we incorrectly leak an event into the architectural VCPU
state that L1 wants to inject, we skip cancellation on nested run.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---

Survived moderate testing here and (currently) makes sense to me, but
please review very carefully. I wouldn't be surprised if I'm still
missing some subtle corner case.

 arch/x86/kvm/vmx.c |   57 +++++++++++++++++++++++----------------------------
 1 files changed, 26 insertions(+), 31 deletions(-)

Comments

Nadav Har'El Feb. 20, 2013, 2:14 p.m. UTC | #1
Hi,

By the way, if you haven't seen my description of why the current code
did what it did, take a look at
http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
Another description might also come in handy:
http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html

On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
> This aligns VMX more with SVM regarding event injection and recovery for
> nested guests. The changes allow to inject interrupts directly from L0
> to L2.
> 
> One difference to SVM is that we always transfer the pending event
> injection into the architectural state of the VCPU and then drop it from
> there if it turns out that we left L2 to enter L1.

Last time I checked, if I'm remembering correctly, the nested SVM code did
something a bit different: After the exit from L2 to L1 and unnecessarily
queuing the pending interrupt for injection, it skipped one entry into L1,
and as usual after the entry the interrupt queue is cleared so next time
around, when L1 one is really entered, the wrong injection is not attempted.

> VMX and SVM are now identical in how they recover event injections from
> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> still contains a valid event and, if yes, transfer the content into L1's
> idt_vectoring_info_field.

> To avoid that we incorrectly leak an event into the architectural VCPU
> state that L1 wants to inject, we skip cancellation on nested run.

I didn't understand this last point.

> @@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>  	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
>  	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>  
> -	/* clear vm-entry fields which are to be cleared on exit */
> -	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
> +	/* drop what we picked up for L0 via vmx_complete_interrupts */
> +	vcpu->arch.nmi_injected = false;
> +	kvm_clear_exception_queue(vcpu);
> +	kvm_clear_interrupt_queue(vcpu);

It would be nice to move these lines out of prepare_vmcs12(), since they
don't really do anything with vmcs12, and move it into
nested_vmx_vmexit() (which is the one which called prepare_vmcs12()).

Did you test this both with PIN_BASED_EXT_INTR_MASK (the usual case) and
!PIN_BASED_EXT_INTR_MASK (the case which interests you)? We need to make
sure that in the former case, this doesn't clear the interrupt queue after
we put an interrupt to be injected in it (at first glance it seems fine,
but these code paths are so convoluted, it's hard to be sure).

> +	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) &&
> +	    vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) {
> +		/*
> +		 * Preserve the event that was supposed to be injected
> +		 * by emulating it would have been returned in
> +		 * IDT_VECTORING_INFO_FIELD.
> +		 */
> +		if (vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) &
> +		    INTR_INFO_VALID_MASK) {
> +			vmcs12->idt_vectoring_info_field =
> +				vmcs12->vm_entry_intr_info_field;
> +			vmcs12->idt_vectoring_error_code =
> +				vmcs12->vm_entry_exception_error_code;
> +			vmcs12->vm_exit_instruction_len =
> +				vmcs12->vm_entry_instruction_len;
> +			vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);

I'm afraid I'm missing what you are trying to do here. Why would
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK ever be
true? After all, the processor clears it after each sucessful exit so
last if() will only succeed on failed entries - but this is NOT the
case if we're in the enclosing if (note that vmcs12->vm_exit_reason  =
vmcs_read32(VM_EXIT_REASON)). Maybe I'm missing something?

Nadav.
Jan Kiszka Feb. 20, 2013, 2:37 p.m. UTC | #2
On 2013-02-20 15:14, Nadav Har'El wrote:
> Hi,
> 
> By the way, if you haven't seen my description of why the current code
> did what it did, take a look at
> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
> Another description might also come in handy:
> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
> 
> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
>> This aligns VMX more with SVM regarding event injection and recovery for
>> nested guests. The changes allow to inject interrupts directly from L0
>> to L2.
>>
>> One difference to SVM is that we always transfer the pending event
>> injection into the architectural state of the VCPU and then drop it from
>> there if it turns out that we left L2 to enter L1.
> 
> Last time I checked, if I'm remembering correctly, the nested SVM code did
> something a bit different: After the exit from L2 to L1 and unnecessarily
> queuing the pending interrupt for injection, it skipped one entry into L1,
> and as usual after the entry the interrupt queue is cleared so next time
> around, when L1 one is really entered, the wrong injection is not attempted.
> 
>> VMX and SVM are now identical in how they recover event injections from
>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>> still contains a valid event and, if yes, transfer the content into L1's
>> idt_vectoring_info_field.
> 
>> To avoid that we incorrectly leak an event into the architectural VCPU
>> state that L1 wants to inject, we skip cancellation on nested run.
> 
> I didn't understand this last point.

- prepare_vmcs02 sets event to be injected into L2
- while trying to enter L2, a cancel condition is met
- we call vmx_cancel_interrupts but should now avoid filling L1's event
  into the arch event queues - it's kept in vmcs12

> 
>> @@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>>  	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
>>  	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>>  
>> -	/* clear vm-entry fields which are to be cleared on exit */
>> -	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
>> +	/* drop what we picked up for L0 via vmx_complete_interrupts */
>> +	vcpu->arch.nmi_injected = false;
>> +	kvm_clear_exception_queue(vcpu);
>> +	kvm_clear_interrupt_queue(vcpu);
> 
> It would be nice to move these lines out of prepare_vmcs12(), since they
> don't really do anything with vmcs12, and move it into
> nested_vmx_vmexit() (which is the one which called prepare_vmcs12()).

OK.

> 
> Did you test this both with PIN_BASED_EXT_INTR_MASK (the usual case) and
> !PIN_BASED_EXT_INTR_MASK (the case which interests you)? We need to make
> sure that in the former case, this doesn't clear the interrupt queue after
> we put an interrupt to be injected in it (at first glance it seems fine,
> but these code paths are so convoluted, it's hard to be sure).

I tested both, but none of my tests was close to cover all potential
corner cases. But that unconditional queue clearing surely deserves
attention and critical review.

> 
>> +	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) &&
>> +	    vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) {
>> +		/*
>> +		 * Preserve the event that was supposed to be injected
>> +		 * by emulating it would have been returned in
>> +		 * IDT_VECTORING_INFO_FIELD.
>> +		 */
>> +		if (vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) &
>> +		    INTR_INFO_VALID_MASK) {
>> +			vmcs12->idt_vectoring_info_field =
>> +				vmcs12->vm_entry_intr_info_field;
>> +			vmcs12->idt_vectoring_error_code =
>> +				vmcs12->vm_entry_exception_error_code;
>> +			vmcs12->vm_exit_instruction_len =
>> +				vmcs12->vm_entry_instruction_len;
>> +			vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
> 
> I'm afraid I'm missing what you are trying to do here. Why would
> vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK ever be
> true? After all, the processor clears it after each sucessful exit so
> last if() will only succeed on failed entries - but this is NOT the
> case if we're in the enclosing if (note that vmcs12->vm_exit_reason  =
> vmcs_read32(VM_EXIT_REASON)). Maybe I'm missing something?

Canceled vmentry as indicated above. Look at vcpu_enter_guest:
kvm_mmu_reload may fail, or we need to handle some async event / perform
some reschedule. But those points are past prepare_vmcs02.

Jan
Jan Kiszka Feb. 20, 2013, 2:53 p.m. UTC | #3
On 2013-02-20 14:01, Jan Kiszka wrote:
> This aligns VMX more with SVM regarding event injection and recovery for
> nested guests. The changes allow to inject interrupts directly from L0
> to L2.
> 
> One difference to SVM is that we always transfer the pending event
> injection into the architectural state of the VCPU and then drop it from
> there if it turns out that we left L2 to enter L1.
> 
> VMX and SVM are now identical in how they recover event injections from
> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> still contains a valid event and, if yes, transfer the content into L1's
> idt_vectoring_info_field.
> 
> To avoid that we incorrectly leak an event into the architectural VCPU
> state that L1 wants to inject, we skip cancellation on nested run.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
> 
> Survived moderate testing here and (currently) makes sense to me, but
> please review very carefully. I wouldn't be surprised if I'm still
> missing some subtle corner case.

Forgot to point this out again: It still takes "KVM: nVMX: Fix injection
of PENDING_INTERRUPT and NMI_WINDOW exits to L1" to make L0->L2
injection work. So this patch logically depends on it.

Jan

> 
>  arch/x86/kvm/vmx.c |   57 +++++++++++++++++++++++----------------------------
>  1 files changed, 26 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index dd3a8a0..7d2fbd2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6489,8 +6489,6 @@ static void __vmx_complete_interrupts(struct vcpu_vmx *vmx,
>  
>  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>  {
> -	if (is_guest_mode(&vmx->vcpu))
> -		return;
>  	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
>  				  VM_EXIT_INSTRUCTION_LEN,
>  				  IDT_VECTORING_ERROR_CODE);
> @@ -6498,7 +6496,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>  
>  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
>  {
> -	if (is_guest_mode(vcpu))
> +	if (to_vmx(vcpu)->nested.nested_run_pending)
>  		return;
>  	__vmx_complete_interrupts(to_vmx(vcpu),
>  				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
> @@ -6531,21 +6529,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	unsigned long debugctlmsr;
>  
> -	if (is_guest_mode(vcpu) && !vmx->nested.nested_run_pending) {
> -		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> -		if (vmcs12->idt_vectoring_info_field &
> -				VECTORING_INFO_VALID_MASK) {
> -			vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> -				vmcs12->idt_vectoring_info_field);
> -			vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> -				vmcs12->vm_exit_instruction_len);
> -			if (vmcs12->idt_vectoring_info_field &
> -					VECTORING_INFO_DELIVER_CODE_MASK)
> -				vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> -					vmcs12->idt_vectoring_error_code);
> -		}
> -	}
> -
>  	/* Record the guest's net vcpu time for enforced NMI injections. */
>  	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
>  		vmx->entry_time = ktime_get();
> @@ -6704,17 +6687,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  
>  	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
>  
> -	if (is_guest_mode(vcpu)) {
> -		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> -		vmcs12->idt_vectoring_info_field = vmx->idt_vectoring_info;
> -		if (vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK) {
> -			vmcs12->idt_vectoring_error_code =
> -				vmcs_read32(IDT_VECTORING_ERROR_CODE);
> -			vmcs12->vm_exit_instruction_len =
> -				vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> -		}
> -	}
> -
>  	vmx->loaded_vmcs->launched = 1;
>  
>  	vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
> @@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>  	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
>  	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>  
> -	/* clear vm-entry fields which are to be cleared on exit */
> -	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
> +	/* drop what we picked up for L0 via vmx_complete_interrupts */
> +	vcpu->arch.nmi_injected = false;
> +	kvm_clear_exception_queue(vcpu);
> +	kvm_clear_interrupt_queue(vcpu);
> +
> +	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) &&
> +	    vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) {
> +		/*
> +		 * Preserve the event that was supposed to be injected
> +		 * by emulating it would have been returned in
> +		 * IDT_VECTORING_INFO_FIELD.
> +		 */
> +		if (vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) &
> +		    INTR_INFO_VALID_MASK) {
> +			vmcs12->idt_vectoring_info_field =
> +				vmcs12->vm_entry_intr_info_field;
> +			vmcs12->idt_vectoring_error_code =
> +				vmcs12->vm_entry_exception_error_code;
> +			vmcs12->vm_exit_instruction_len =
> +				vmcs12->vm_entry_instruction_len;
> +			vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
> +		}
> +
> +		/* clear vm-entry fields which are to be cleared on exit */
>  		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
> +	}
>  }
>  
>  /*
>
Gleb Natapov Feb. 20, 2013, 3:30 p.m. UTC | #4
On Wed, Feb 20, 2013 at 03:53:53PM +0100, Jan Kiszka wrote:
> On 2013-02-20 14:01, Jan Kiszka wrote:
> > This aligns VMX more with SVM regarding event injection and recovery for
> > nested guests. The changes allow to inject interrupts directly from L0
> > to L2.
> > 
> > One difference to SVM is that we always transfer the pending event
> > injection into the architectural state of the VCPU and then drop it from
> > there if it turns out that we left L2 to enter L1.
> > 
> > VMX and SVM are now identical in how they recover event injections from
> > unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> > still contains a valid event and, if yes, transfer the content into L1's
> > idt_vectoring_info_field.
> > 
> > To avoid that we incorrectly leak an event into the architectural VCPU
> > state that L1 wants to inject, we skip cancellation on nested run.
> > 
> > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> > ---
> > 
> > Survived moderate testing here and (currently) makes sense to me, but
> > please review very carefully. I wouldn't be surprised if I'm still
> > missing some subtle corner case.
> 
> Forgot to point this out again: It still takes "KVM: nVMX: Fix injection
> of PENDING_INTERRUPT and NMI_WINDOW exits to L1" to make L0->L2
> injection work. So this patch logically depends on it.
> 
But this patch has hunks from that patch.

> Jan
> 
> > 
> >  arch/x86/kvm/vmx.c |   57 +++++++++++++++++++++++----------------------------
> >  1 files changed, 26 insertions(+), 31 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index dd3a8a0..7d2fbd2 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -6489,8 +6489,6 @@ static void __vmx_complete_interrupts(struct vcpu_vmx *vmx,
> >  
> >  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
> >  {
> > -	if (is_guest_mode(&vmx->vcpu))
> > -		return;
> >  	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
> >  				  VM_EXIT_INSTRUCTION_LEN,
> >  				  IDT_VECTORING_ERROR_CODE);
> > @@ -6498,7 +6496,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
> >  
> >  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
> >  {
> > -	if (is_guest_mode(vcpu))
> > +	if (to_vmx(vcpu)->nested.nested_run_pending)
> >  		return;
> >  	__vmx_complete_interrupts(to_vmx(vcpu),
> >  				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
> > @@ -6531,21 +6529,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >  	unsigned long debugctlmsr;
> >  
> > -	if (is_guest_mode(vcpu) && !vmx->nested.nested_run_pending) {
> > -		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > -		if (vmcs12->idt_vectoring_info_field &
> > -				VECTORING_INFO_VALID_MASK) {
> > -			vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> > -				vmcs12->idt_vectoring_info_field);
> > -			vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> > -				vmcs12->vm_exit_instruction_len);
> > -			if (vmcs12->idt_vectoring_info_field &
> > -					VECTORING_INFO_DELIVER_CODE_MASK)
> > -				vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> > -					vmcs12->idt_vectoring_error_code);
> > -		}
> > -	}
> > -
> >  	/* Record the guest's net vcpu time for enforced NMI injections. */
> >  	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
> >  		vmx->entry_time = ktime_get();
> > @@ -6704,17 +6687,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
> >  
> >  	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> >  
> > -	if (is_guest_mode(vcpu)) {
> > -		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > -		vmcs12->idt_vectoring_info_field = vmx->idt_vectoring_info;
> > -		if (vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK) {
> > -			vmcs12->idt_vectoring_error_code =
> > -				vmcs_read32(IDT_VECTORING_ERROR_CODE);
> > -			vmcs12->vm_exit_instruction_len =
> > -				vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> > -		}
> > -	}
> > -
> >  	vmx->loaded_vmcs->launched = 1;
> >  
> >  	vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
> > @@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> >  	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> >  	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> >  
> > -	/* clear vm-entry fields which are to be cleared on exit */
> > -	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
> > +	/* drop what we picked up for L0 via vmx_complete_interrupts */
> > +	vcpu->arch.nmi_injected = false;
> > +	kvm_clear_exception_queue(vcpu);
> > +	kvm_clear_interrupt_queue(vcpu);
> > +
> > +	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) &&
> > +	    vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) {
> > +		/*
> > +		 * Preserve the event that was supposed to be injected
> > +		 * by emulating it would have been returned in
> > +		 * IDT_VECTORING_INFO_FIELD.
> > +		 */
> > +		if (vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) &
> > +		    INTR_INFO_VALID_MASK) {
> > +			vmcs12->idt_vectoring_info_field =
> > +				vmcs12->vm_entry_intr_info_field;
> > +			vmcs12->idt_vectoring_error_code =
> > +				vmcs12->vm_entry_exception_error_code;
> > +			vmcs12->vm_exit_instruction_len =
> > +				vmcs12->vm_entry_instruction_len;
> > +			vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
> > +		}
> > +
> > +		/* clear vm-entry fields which are to be cleared on exit */
> >  		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
> > +	}
> >  }
> >  
> >  /*
> > 
> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
> Corporate Competence Center Embedded Linux

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kiszka Feb. 20, 2013, 3:51 p.m. UTC | #5
On 2013-02-20 16:30, Gleb Natapov wrote:
> On Wed, Feb 20, 2013 at 03:53:53PM +0100, Jan Kiszka wrote:
>> On 2013-02-20 14:01, Jan Kiszka wrote:
>>> This aligns VMX more with SVM regarding event injection and recovery for
>>> nested guests. The changes allow to inject interrupts directly from L0
>>> to L2.
>>>
>>> One difference to SVM is that we always transfer the pending event
>>> injection into the architectural state of the VCPU and then drop it from
>>> there if it turns out that we left L2 to enter L1.
>>>
>>> VMX and SVM are now identical in how they recover event injections from
>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>>> still contains a valid event and, if yes, transfer the content into L1's
>>> idt_vectoring_info_field.
>>>
>>> To avoid that we incorrectly leak an event into the architectural VCPU
>>> state that L1 wants to inject, we skip cancellation on nested run.
>>>
>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>> ---
>>>
>>> Survived moderate testing here and (currently) makes sense to me, but
>>> please review very carefully. I wouldn't be surprised if I'm still
>>> missing some subtle corner case.
>>
>> Forgot to point this out again: It still takes "KVM: nVMX: Fix injection
>> of PENDING_INTERRUPT and NMI_WINDOW exits to L1" to make L0->L2
>> injection work. So this patch logically depends on it.
>>
> But this patch has hunks from that patch.

Not mechanically.

If you prefer me merging them together, let me know.

Jan
Gleb Natapov Feb. 20, 2013, 3:57 p.m. UTC | #6
On Wed, Feb 20, 2013 at 04:51:39PM +0100, Jan Kiszka wrote:
> On 2013-02-20 16:30, Gleb Natapov wrote:
> > On Wed, Feb 20, 2013 at 03:53:53PM +0100, Jan Kiszka wrote:
> >> On 2013-02-20 14:01, Jan Kiszka wrote:
> >>> This aligns VMX more with SVM regarding event injection and recovery for
> >>> nested guests. The changes allow to inject interrupts directly from L0
> >>> to L2.
> >>>
> >>> One difference to SVM is that we always transfer the pending event
> >>> injection into the architectural state of the VCPU and then drop it from
> >>> there if it turns out that we left L2 to enter L1.
> >>>
> >>> VMX and SVM are now identical in how they recover event injections from
> >>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> >>> still contains a valid event and, if yes, transfer the content into L1's
> >>> idt_vectoring_info_field.
> >>>
> >>> To avoid that we incorrectly leak an event into the architectural VCPU
> >>> state that L1 wants to inject, we skip cancellation on nested run.
> >>>
> >>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> >>> ---
> >>>
> >>> Survived moderate testing here and (currently) makes sense to me, but
> >>> please review very carefully. I wouldn't be surprised if I'm still
> >>> missing some subtle corner case.
> >>
> >> Forgot to point this out again: It still takes "KVM: nVMX: Fix injection
> >> of PENDING_INTERRUPT and NMI_WINDOW exits to L1" to make L0->L2
> >> injection work. So this patch logically depends on it.
> >>
> > But this patch has hunks from that patch.
> 
> Not mechanically.
> 
What do you mean?

> If you prefer me merging them together, let me know.
> 
For review not necessary, for applying preferably.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kiszka Feb. 20, 2013, 4 p.m. UTC | #7
On 2013-02-20 16:57, Gleb Natapov wrote:
> On Wed, Feb 20, 2013 at 04:51:39PM +0100, Jan Kiszka wrote:
>> On 2013-02-20 16:30, Gleb Natapov wrote:
>>> On Wed, Feb 20, 2013 at 03:53:53PM +0100, Jan Kiszka wrote:
>>>> On 2013-02-20 14:01, Jan Kiszka wrote:
>>>>> This aligns VMX more with SVM regarding event injection and recovery for
>>>>> nested guests. The changes allow to inject interrupts directly from L0
>>>>> to L2.
>>>>>
>>>>> One difference to SVM is that we always transfer the pending event
>>>>> injection into the architectural state of the VCPU and then drop it from
>>>>> there if it turns out that we left L2 to enter L1.
>>>>>
>>>>> VMX and SVM are now identical in how they recover event injections from
>>>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>>>>> still contains a valid event and, if yes, transfer the content into L1's
>>>>> idt_vectoring_info_field.
>>>>>
>>>>> To avoid that we incorrectly leak an event into the architectural VCPU
>>>>> state that L1 wants to inject, we skip cancellation on nested run.
>>>>>
>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>>>> ---
>>>>>
>>>>> Survived moderate testing here and (currently) makes sense to me, but
>>>>> please review very carefully. I wouldn't be surprised if I'm still
>>>>> missing some subtle corner case.
>>>>
>>>> Forgot to point this out again: It still takes "KVM: nVMX: Fix injection
>>>> of PENDING_INTERRUPT and NMI_WINDOW exits to L1" to make L0->L2
>>>> injection work. So this patch logically depends on it.
>>>>
>>> But this patch has hunks from that patch.
>>
>> Not mechanically.
>>
> What do you mean?

You can apply them in arbitrary order, just minor offset shifts will be
the result.

> 
>> If you prefer me merging them together, let me know.
>>
> For review not necessary, for applying preferably.

OK, will wait for review on this, then send out a combo patch.

Jan
Gleb Natapov Feb. 20, 2013, 4:46 p.m. UTC | #8
On Wed, Feb 20, 2013 at 02:01:47PM +0100, Jan Kiszka wrote:
> This aligns VMX more with SVM regarding event injection and recovery for
> nested guests. The changes allow to inject interrupts directly from L0
> to L2.
> 
> One difference to SVM is that we always transfer the pending event
> injection into the architectural state of the VCPU and then drop it from
> there if it turns out that we left L2 to enter L1.
> 
> VMX and SVM are now identical in how they recover event injections from
> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> still contains a valid event and, if yes, transfer the content into L1's
> idt_vectoring_info_field.
> 
> To avoid that we incorrectly leak an event into the architectural VCPU
> state that L1 wants to inject, we skip cancellation on nested run.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
> 
> Survived moderate testing here and (currently) makes sense to me, but
> please review very carefully. I wouldn't be surprised if I'm still
> missing some subtle corner case.
> 
>  arch/x86/kvm/vmx.c |   57 +++++++++++++++++++++++----------------------------
>  1 files changed, 26 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index dd3a8a0..7d2fbd2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6489,8 +6489,6 @@ static void __vmx_complete_interrupts(struct vcpu_vmx *vmx,
>  
>  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>  {
> -	if (is_guest_mode(&vmx->vcpu))
> -		return;
>  	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
>  				  VM_EXIT_INSTRUCTION_LEN,
>  				  IDT_VECTORING_ERROR_CODE);
> @@ -6498,7 +6496,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>  
>  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
>  {
> -	if (is_guest_mode(vcpu))
> +	if (to_vmx(vcpu)->nested.nested_run_pending)
>  		return;
Why is this needed here?

>  	__vmx_complete_interrupts(to_vmx(vcpu),
>  				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
> @@ -6531,21 +6529,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	unsigned long debugctlmsr;
>  
> -	if (is_guest_mode(vcpu) && !vmx->nested.nested_run_pending) {
> -		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> -		if (vmcs12->idt_vectoring_info_field &
> -				VECTORING_INFO_VALID_MASK) {
> -			vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> -				vmcs12->idt_vectoring_info_field);
> -			vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> -				vmcs12->vm_exit_instruction_len);
> -			if (vmcs12->idt_vectoring_info_field &
> -					VECTORING_INFO_DELIVER_CODE_MASK)
> -				vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> -					vmcs12->idt_vectoring_error_code);
> -		}
> -	}
> -
>  	/* Record the guest's net vcpu time for enforced NMI injections. */
>  	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
>  		vmx->entry_time = ktime_get();
> @@ -6704,17 +6687,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  
>  	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
>  
> -	if (is_guest_mode(vcpu)) {
> -		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> -		vmcs12->idt_vectoring_info_field = vmx->idt_vectoring_info;
> -		if (vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK) {
> -			vmcs12->idt_vectoring_error_code =
> -				vmcs_read32(IDT_VECTORING_ERROR_CODE);
> -			vmcs12->vm_exit_instruction_len =
> -				vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> -		}
> -	}
> -
>  	vmx->loaded_vmcs->launched = 1;
>  
>  	vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
> @@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>  	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
>  	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>  
> -	/* clear vm-entry fields which are to be cleared on exit */
> -	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
> +	/* drop what we picked up for L0 via vmx_complete_interrupts */
> +	vcpu->arch.nmi_injected = false;
> +	kvm_clear_exception_queue(vcpu);
> +	kvm_clear_interrupt_queue(vcpu);
> +
> +	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) &&
> +	    vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) {
> +		/*
> +		 * Preserve the event that was supposed to be injected
> +		 * by emulating it would have been returned in
> +		 * IDT_VECTORING_INFO_FIELD.
> +		 */
> +		if (vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) &
> +		    INTR_INFO_VALID_MASK) {
> +			vmcs12->idt_vectoring_info_field =
> +				vmcs12->vm_entry_intr_info_field;
> +			vmcs12->idt_vectoring_error_code =
> +				vmcs12->vm_entry_exception_error_code;
> +			vmcs12->vm_exit_instruction_len =
> +				vmcs12->vm_entry_instruction_len;
> +			vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
> +		}
> +
> +		/* clear vm-entry fields which are to be cleared on exit */
>  		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
> +	}
>  }
>  
>  /*
> -- 
> 1.7.3.4

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kiszka Feb. 20, 2013, 4:48 p.m. UTC | #9
On 2013-02-20 17:46, Gleb Natapov wrote:
> On Wed, Feb 20, 2013 at 02:01:47PM +0100, Jan Kiszka wrote:
>> This aligns VMX more with SVM regarding event injection and recovery for
>> nested guests. The changes allow to inject interrupts directly from L0
>> to L2.
>>
>> One difference to SVM is that we always transfer the pending event
>> injection into the architectural state of the VCPU and then drop it from
>> there if it turns out that we left L2 to enter L1.
>>
>> VMX and SVM are now identical in how they recover event injections from
>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>> still contains a valid event and, if yes, transfer the content into L1's
>> idt_vectoring_info_field.
>>
>> To avoid that we incorrectly leak an event into the architectural VCPU
>> state that L1 wants to inject, we skip cancellation on nested run.
>>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>> ---
>>
>> Survived moderate testing here and (currently) makes sense to me, but
>> please review very carefully. I wouldn't be surprised if I'm still
>> missing some subtle corner case.
>>
>>  arch/x86/kvm/vmx.c |   57 +++++++++++++++++++++++----------------------------
>>  1 files changed, 26 insertions(+), 31 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index dd3a8a0..7d2fbd2 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -6489,8 +6489,6 @@ static void __vmx_complete_interrupts(struct vcpu_vmx *vmx,
>>  
>>  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>>  {
>> -	if (is_guest_mode(&vmx->vcpu))
>> -		return;
>>  	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
>>  				  VM_EXIT_INSTRUCTION_LEN,
>>  				  IDT_VECTORING_ERROR_CODE);
>> @@ -6498,7 +6496,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>>  
>>  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
>>  {
>> -	if (is_guest_mode(vcpu))
>> +	if (to_vmx(vcpu)->nested.nested_run_pending)
>>  		return;
> Why is this needed here?

Please check if my reply to Nadav explains this sufficiently.

Jan
Gleb Natapov Feb. 20, 2013, 4:51 p.m. UTC | #10
On Wed, Feb 20, 2013 at 05:48:40PM +0100, Jan Kiszka wrote:
> On 2013-02-20 17:46, Gleb Natapov wrote:
> > On Wed, Feb 20, 2013 at 02:01:47PM +0100, Jan Kiszka wrote:
> >> This aligns VMX more with SVM regarding event injection and recovery for
> >> nested guests. The changes allow to inject interrupts directly from L0
> >> to L2.
> >>
> >> One difference to SVM is that we always transfer the pending event
> >> injection into the architectural state of the VCPU and then drop it from
> >> there if it turns out that we left L2 to enter L1.
> >>
> >> VMX and SVM are now identical in how they recover event injections from
> >> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> >> still contains a valid event and, if yes, transfer the content into L1's
> >> idt_vectoring_info_field.
> >>
> >> To avoid that we incorrectly leak an event into the architectural VCPU
> >> state that L1 wants to inject, we skip cancellation on nested run.
> >>
> >> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> >> ---
> >>
> >> Survived moderate testing here and (currently) makes sense to me, but
> >> please review very carefully. I wouldn't be surprised if I'm still
> >> missing some subtle corner case.
> >>
> >>  arch/x86/kvm/vmx.c |   57 +++++++++++++++++++++++----------------------------
> >>  1 files changed, 26 insertions(+), 31 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index dd3a8a0..7d2fbd2 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -6489,8 +6489,6 @@ static void __vmx_complete_interrupts(struct vcpu_vmx *vmx,
> >>  
> >>  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
> >>  {
> >> -	if (is_guest_mode(&vmx->vcpu))
> >> -		return;
> >>  	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
> >>  				  VM_EXIT_INSTRUCTION_LEN,
> >>  				  IDT_VECTORING_ERROR_CODE);
> >> @@ -6498,7 +6496,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
> >>  
> >>  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
> >>  {
> >> -	if (is_guest_mode(vcpu))
> >> +	if (to_vmx(vcpu)->nested.nested_run_pending)
> >>  		return;
> > Why is this needed here?
> 
> Please check if my reply to Nadav explains this sufficiently.
> 
Ah, sorry. Will follow up there if it is not.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Feb. 20, 2013, 5:01 p.m. UTC | #11
On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
> On 2013-02-20 15:14, Nadav Har'El wrote:
> > Hi,
> > 
> > By the way, if you haven't seen my description of why the current code
> > did what it did, take a look at
> > http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
> > Another description might also come in handy:
> > http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
> > 
> > On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
> >> This aligns VMX more with SVM regarding event injection and recovery for
> >> nested guests. The changes allow to inject interrupts directly from L0
> >> to L2.
> >>
> >> One difference to SVM is that we always transfer the pending event
> >> injection into the architectural state of the VCPU and then drop it from
> >> there if it turns out that we left L2 to enter L1.
> > 
> > Last time I checked, if I'm remembering correctly, the nested SVM code did
> > something a bit different: After the exit from L2 to L1 and unnecessarily
> > queuing the pending interrupt for injection, it skipped one entry into L1,
> > and as usual after the entry the interrupt queue is cleared so next time
> > around, when L1 one is really entered, the wrong injection is not attempted.
> > 
> >> VMX and SVM are now identical in how they recover event injections from
> >> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> >> still contains a valid event and, if yes, transfer the content into L1's
> >> idt_vectoring_info_field.
> > 
> >> To avoid that we incorrectly leak an event into the architectural VCPU
> >> state that L1 wants to inject, we skip cancellation on nested run.
> > 
> > I didn't understand this last point.
> 
> - prepare_vmcs02 sets event to be injected into L2
> - while trying to enter L2, a cancel condition is met
> - we call vmx_cancel_interrupts but should now avoid filling L1's event
>   into the arch event queues - it's kept in vmcs12
> 
But what if we put it in arch event queue? It will be reinjected during
next entry attempt, so nothing bad happens and we have one less if() to explain,
or do I miss something terrible that will happen?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kiszka Feb. 20, 2013, 5:24 p.m. UTC | #12
On 2013-02-20 18:01, Gleb Natapov wrote:
> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
>> On 2013-02-20 15:14, Nadav Har'El wrote:
>>> Hi,
>>>
>>> By the way, if you haven't seen my description of why the current code
>>> did what it did, take a look at
>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
>>> Another description might also come in handy:
>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
>>>
>>> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
>>>> This aligns VMX more with SVM regarding event injection and recovery for
>>>> nested guests. The changes allow to inject interrupts directly from L0
>>>> to L2.
>>>>
>>>> One difference to SVM is that we always transfer the pending event
>>>> injection into the architectural state of the VCPU and then drop it from
>>>> there if it turns out that we left L2 to enter L1.
>>>
>>> Last time I checked, if I'm remembering correctly, the nested SVM code did
>>> something a bit different: After the exit from L2 to L1 and unnecessarily
>>> queuing the pending interrupt for injection, it skipped one entry into L1,
>>> and as usual after the entry the interrupt queue is cleared so next time
>>> around, when L1 one is really entered, the wrong injection is not attempted.
>>>
>>>> VMX and SVM are now identical in how they recover event injections from
>>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>>>> still contains a valid event and, if yes, transfer the content into L1's
>>>> idt_vectoring_info_field.
>>>
>>>> To avoid that we incorrectly leak an event into the architectural VCPU
>>>> state that L1 wants to inject, we skip cancellation on nested run.
>>>
>>> I didn't understand this last point.
>>
>> - prepare_vmcs02 sets event to be injected into L2
>> - while trying to enter L2, a cancel condition is met
>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
>>   into the arch event queues - it's kept in vmcs12
>>
> But what if we put it in arch event queue? It will be reinjected during
> next entry attempt, so nothing bad happens and we have one less if() to explain,
> or do I miss something terrible that will happen?

I started without that if but ran into troubles with KVM-on-KVM (L1
locks up). Let me dig out the instrumentation and check the event flow
again.

Jan
Jan Kiszka Feb. 20, 2013, 5:50 p.m. UTC | #13
On 2013-02-20 18:24, Jan Kiszka wrote:
> On 2013-02-20 18:01, Gleb Natapov wrote:
>> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
>>> On 2013-02-20 15:14, Nadav Har'El wrote:
>>>> Hi,
>>>>
>>>> By the way, if you haven't seen my description of why the current code
>>>> did what it did, take a look at
>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
>>>> Another description might also come in handy:
>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
>>>>
>>>> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
>>>>> This aligns VMX more with SVM regarding event injection and recovery for
>>>>> nested guests. The changes allow to inject interrupts directly from L0
>>>>> to L2.
>>>>>
>>>>> One difference to SVM is that we always transfer the pending event
>>>>> injection into the architectural state of the VCPU and then drop it from
>>>>> there if it turns out that we left L2 to enter L1.
>>>>
>>>> Last time I checked, if I'm remembering correctly, the nested SVM code did
>>>> something a bit different: After the exit from L2 to L1 and unnecessarily
>>>> queuing the pending interrupt for injection, it skipped one entry into L1,
>>>> and as usual after the entry the interrupt queue is cleared so next time
>>>> around, when L1 one is really entered, the wrong injection is not attempted.
>>>>
>>>>> VMX and SVM are now identical in how they recover event injections from
>>>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>>>>> still contains a valid event and, if yes, transfer the content into L1's
>>>>> idt_vectoring_info_field.
>>>>
>>>>> To avoid that we incorrectly leak an event into the architectural VCPU
>>>>> state that L1 wants to inject, we skip cancellation on nested run.
>>>>
>>>> I didn't understand this last point.
>>>
>>> - prepare_vmcs02 sets event to be injected into L2
>>> - while trying to enter L2, a cancel condition is met
>>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
>>>   into the arch event queues - it's kept in vmcs12
>>>
>> But what if we put it in arch event queue? It will be reinjected during
>> next entry attempt, so nothing bad happens and we have one less if() to explain,
>> or do I miss something terrible that will happen?
> 
> I started without that if but ran into troubles with KVM-on-KVM (L1
> locks up). Let me dig out the instrumentation and check the event flow
> again.

OK, got it again: If we transfer an IRQ that L1 wants to send to L2 into
the architectural VCPU state, we will also trigger enable_irq_window.
And that raises KVM_REQ_IMMEDIATE_EXIT again as it thinks L0 wants
inject. That will send us into an endless loop.

Not sure if we can and should handle this scenario in enable_irq_window
in a nicer way. Open for suggestions.

Jan
Gleb Natapov Feb. 21, 2013, 9:22 a.m. UTC | #14
On Wed, Feb 20, 2013 at 06:50:50PM +0100, Jan Kiszka wrote:
> On 2013-02-20 18:24, Jan Kiszka wrote:
> > On 2013-02-20 18:01, Gleb Natapov wrote:
> >> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
> >>> On 2013-02-20 15:14, Nadav Har'El wrote:
> >>>> Hi,
> >>>>
> >>>> By the way, if you haven't seen my description of why the current code
> >>>> did what it did, take a look at
> >>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
> >>>> Another description might also come in handy:
> >>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
> >>>>
> >>>> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
> >>>>> This aligns VMX more with SVM regarding event injection and recovery for
> >>>>> nested guests. The changes allow to inject interrupts directly from L0
> >>>>> to L2.
> >>>>>
> >>>>> One difference to SVM is that we always transfer the pending event
> >>>>> injection into the architectural state of the VCPU and then drop it from
> >>>>> there if it turns out that we left L2 to enter L1.
> >>>>
> >>>> Last time I checked, if I'm remembering correctly, the nested SVM code did
> >>>> something a bit different: After the exit from L2 to L1 and unnecessarily
> >>>> queuing the pending interrupt for injection, it skipped one entry into L1,
> >>>> and as usual after the entry the interrupt queue is cleared so next time
> >>>> around, when L1 one is really entered, the wrong injection is not attempted.
> >>>>
> >>>>> VMX and SVM are now identical in how they recover event injections from
> >>>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> >>>>> still contains a valid event and, if yes, transfer the content into L1's
> >>>>> idt_vectoring_info_field.
> >>>>
> >>>>> To avoid that we incorrectly leak an event into the architectural VCPU
> >>>>> state that L1 wants to inject, we skip cancellation on nested run.
> >>>>
> >>>> I didn't understand this last point.
> >>>
> >>> - prepare_vmcs02 sets event to be injected into L2
> >>> - while trying to enter L2, a cancel condition is met
> >>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
> >>>   into the arch event queues - it's kept in vmcs12
> >>>
> >> But what if we put it in arch event queue? It will be reinjected during
> >> next entry attempt, so nothing bad happens and we have one less if() to explain,
> >> or do I miss something terrible that will happen?
> > 
> > I started without that if but ran into troubles with KVM-on-KVM (L1
> > locks up). Let me dig out the instrumentation and check the event flow
> > again.
> 
> OK, got it again: If we transfer an IRQ that L1 wants to send to L2 into
> the architectural VCPU state, we will also trigger enable_irq_window.
> And that raises KVM_REQ_IMMEDIATE_EXIT again as it thinks L0 wants
> inject. That will send us into an endless loop.
> 
Why would we trigger enable_irq_window()? enable_irq_window() triggers
only if interrupt is pending in one of irq chips, not in architectural
VCPU state.

> Not sure if we can and should handle this scenario in enable_irq_window
> in a nicer way. Open for suggestions.
> 
> Jan
> 
> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
> Corporate Competence Center Embedded Linux

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kiszka Feb. 21, 2013, 9:43 a.m. UTC | #15
On 2013-02-21 10:22, Gleb Natapov wrote:
> On Wed, Feb 20, 2013 at 06:50:50PM +0100, Jan Kiszka wrote:
>> On 2013-02-20 18:24, Jan Kiszka wrote:
>>> On 2013-02-20 18:01, Gleb Natapov wrote:
>>>> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
>>>>> On 2013-02-20 15:14, Nadav Har'El wrote:
>>>>>> Hi,
>>>>>>
>>>>>> By the way, if you haven't seen my description of why the current code
>>>>>> did what it did, take a look at
>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
>>>>>> Another description might also come in handy:
>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
>>>>>>
>>>>>> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
>>>>>>> This aligns VMX more with SVM regarding event injection and recovery for
>>>>>>> nested guests. The changes allow to inject interrupts directly from L0
>>>>>>> to L2.
>>>>>>>
>>>>>>> One difference to SVM is that we always transfer the pending event
>>>>>>> injection into the architectural state of the VCPU and then drop it from
>>>>>>> there if it turns out that we left L2 to enter L1.
>>>>>>
>>>>>> Last time I checked, if I'm remembering correctly, the nested SVM code did
>>>>>> something a bit different: After the exit from L2 to L1 and unnecessarily
>>>>>> queuing the pending interrupt for injection, it skipped one entry into L1,
>>>>>> and as usual after the entry the interrupt queue is cleared so next time
>>>>>> around, when L1 one is really entered, the wrong injection is not attempted.
>>>>>>
>>>>>>> VMX and SVM are now identical in how they recover event injections from
>>>>>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>>>>>>> still contains a valid event and, if yes, transfer the content into L1's
>>>>>>> idt_vectoring_info_field.
>>>>>>
>>>>>>> To avoid that we incorrectly leak an event into the architectural VCPU
>>>>>>> state that L1 wants to inject, we skip cancellation on nested run.
>>>>>>
>>>>>> I didn't understand this last point.
>>>>>
>>>>> - prepare_vmcs02 sets event to be injected into L2
>>>>> - while trying to enter L2, a cancel condition is met
>>>>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
>>>>>   into the arch event queues - it's kept in vmcs12
>>>>>
>>>> But what if we put it in arch event queue? It will be reinjected during
>>>> next entry attempt, so nothing bad happens and we have one less if() to explain,
>>>> or do I miss something terrible that will happen?
>>>
>>> I started without that if but ran into troubles with KVM-on-KVM (L1
>>> locks up). Let me dig out the instrumentation and check the event flow
>>> again.
>>
>> OK, got it again: If we transfer an IRQ that L1 wants to send to L2 into
>> the architectural VCPU state, we will also trigger enable_irq_window.
>> And that raises KVM_REQ_IMMEDIATE_EXIT again as it thinks L0 wants
>> inject. That will send us into an endless loop.
>>
> Why would we trigger enable_irq_window()? enable_irq_window() triggers
> only if interrupt is pending in one of irq chips, not in architectural
> VCPU state.

Precisely this is the case if an IRQ for L1 arrived while we tried to
enter L2 and caused the cancellation above.

Jan
Gleb Natapov Feb. 21, 2013, 10:06 a.m. UTC | #16
On Thu, Feb 21, 2013 at 10:43:57AM +0100, Jan Kiszka wrote:
> On 2013-02-21 10:22, Gleb Natapov wrote:
> > On Wed, Feb 20, 2013 at 06:50:50PM +0100, Jan Kiszka wrote:
> >> On 2013-02-20 18:24, Jan Kiszka wrote:
> >>> On 2013-02-20 18:01, Gleb Natapov wrote:
> >>>> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
> >>>>> On 2013-02-20 15:14, Nadav Har'El wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> By the way, if you haven't seen my description of why the current code
> >>>>>> did what it did, take a look at
> >>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
> >>>>>> Another description might also come in handy:
> >>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
> >>>>>>
> >>>>>> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
> >>>>>>> This aligns VMX more with SVM regarding event injection and recovery for
> >>>>>>> nested guests. The changes allow to inject interrupts directly from L0
> >>>>>>> to L2.
> >>>>>>>
> >>>>>>> One difference to SVM is that we always transfer the pending event
> >>>>>>> injection into the architectural state of the VCPU and then drop it from
> >>>>>>> there if it turns out that we left L2 to enter L1.
> >>>>>>
> >>>>>> Last time I checked, if I'm remembering correctly, the nested SVM code did
> >>>>>> something a bit different: After the exit from L2 to L1 and unnecessarily
> >>>>>> queuing the pending interrupt for injection, it skipped one entry into L1,
> >>>>>> and as usual after the entry the interrupt queue is cleared so next time
> >>>>>> around, when L1 one is really entered, the wrong injection is not attempted.
> >>>>>>
> >>>>>>> VMX and SVM are now identical in how they recover event injections from
> >>>>>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> >>>>>>> still contains a valid event and, if yes, transfer the content into L1's
> >>>>>>> idt_vectoring_info_field.
> >>>>>>
> >>>>>>> To avoid that we incorrectly leak an event into the architectural VCPU
> >>>>>>> state that L1 wants to inject, we skip cancellation on nested run.
> >>>>>>
> >>>>>> I didn't understand this last point.
> >>>>>
> >>>>> - prepare_vmcs02 sets event to be injected into L2
> >>>>> - while trying to enter L2, a cancel condition is met
> >>>>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
> >>>>>   into the arch event queues - it's kept in vmcs12
> >>>>>
> >>>> But what if we put it in arch event queue? It will be reinjected during
> >>>> next entry attempt, so nothing bad happens and we have one less if() to explain,
> >>>> or do I miss something terrible that will happen?
> >>>
> >>> I started without that if but ran into troubles with KVM-on-KVM (L1
> >>> locks up). Let me dig out the instrumentation and check the event flow
> >>> again.
> >>
> >> OK, got it again: If we transfer an IRQ that L1 wants to send to L2 into
> >> the architectural VCPU state, we will also trigger enable_irq_window.
> >> And that raises KVM_REQ_IMMEDIATE_EXIT again as it thinks L0 wants
> >> inject. That will send us into an endless loop.
> >>
> > Why would we trigger enable_irq_window()? enable_irq_window() triggers
> > only if interrupt is pending in one of irq chips, not in architectural
> > VCPU state.
> 
> Precisely this is the case if an IRQ for L1 arrived while we tried to
> enter L2 and caused the cancellation above.
> 
But during next entry the cancelled interrupt is transfered
from architectural VCPU state to VM_ENTRY_INTR_INFO_FIELD by
inject_pending_event()->vmx_inject_irq(), so at the point where
enable_irq_window() is called the state is exactly the same no matter
whether we canceled interrupt or not during previous entry attempt. What
am I missing? Oh may be I am missing that if we do not cancel interrupt
then inject_pending_event() will skip
  if (vcpu->arch.interrupt.pending)
    ....
and will inject interrupt from APIC that caused cancellation of previous
entry, but then this is a bug since this new interrupt will overwrite
the one that is still in VM_ENTRY_INTR_INFO_FIELD from previous entry
attempt and there may be another pending interrupt in APIC anyway that
will cause enable_irq_window() too.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kiszka Feb. 21, 2013, 10:18 a.m. UTC | #17
On 2013-02-21 11:06, Gleb Natapov wrote:
> On Thu, Feb 21, 2013 at 10:43:57AM +0100, Jan Kiszka wrote:
>> On 2013-02-21 10:22, Gleb Natapov wrote:
>>> On Wed, Feb 20, 2013 at 06:50:50PM +0100, Jan Kiszka wrote:
>>>> On 2013-02-20 18:24, Jan Kiszka wrote:
>>>>> On 2013-02-20 18:01, Gleb Natapov wrote:
>>>>>> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
>>>>>>> On 2013-02-20 15:14, Nadav Har'El wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> By the way, if you haven't seen my description of why the current code
>>>>>>>> did what it did, take a look at
>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
>>>>>>>> Another description might also come in handy:
>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
>>>>>>>>
>>>>>>>> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
>>>>>>>>> This aligns VMX more with SVM regarding event injection and recovery for
>>>>>>>>> nested guests. The changes allow to inject interrupts directly from L0
>>>>>>>>> to L2.
>>>>>>>>>
>>>>>>>>> One difference to SVM is that we always transfer the pending event
>>>>>>>>> injection into the architectural state of the VCPU and then drop it from
>>>>>>>>> there if it turns out that we left L2 to enter L1.
>>>>>>>>
>>>>>>>> Last time I checked, if I'm remembering correctly, the nested SVM code did
>>>>>>>> something a bit different: After the exit from L2 to L1 and unnecessarily
>>>>>>>> queuing the pending interrupt for injection, it skipped one entry into L1,
>>>>>>>> and as usual after the entry the interrupt queue is cleared so next time
>>>>>>>> around, when L1 one is really entered, the wrong injection is not attempted.
>>>>>>>>
>>>>>>>>> VMX and SVM are now identical in how they recover event injections from
>>>>>>>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>>>>>>>>> still contains a valid event and, if yes, transfer the content into L1's
>>>>>>>>> idt_vectoring_info_field.
>>>>>>>>
>>>>>>>>> To avoid that we incorrectly leak an event into the architectural VCPU
>>>>>>>>> state that L1 wants to inject, we skip cancellation on nested run.
>>>>>>>>
>>>>>>>> I didn't understand this last point.
>>>>>>>
>>>>>>> - prepare_vmcs02 sets event to be injected into L2
>>>>>>> - while trying to enter L2, a cancel condition is met
>>>>>>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
>>>>>>>   into the arch event queues - it's kept in vmcs12
>>>>>>>
>>>>>> But what if we put it in arch event queue? It will be reinjected during
>>>>>> next entry attempt, so nothing bad happens and we have one less if() to explain,
>>>>>> or do I miss something terrible that will happen?
>>>>>
>>>>> I started without that if but ran into troubles with KVM-on-KVM (L1
>>>>> locks up). Let me dig out the instrumentation and check the event flow
>>>>> again.
>>>>
>>>> OK, got it again: If we transfer an IRQ that L1 wants to send to L2 into
>>>> the architectural VCPU state, we will also trigger enable_irq_window.
>>>> And that raises KVM_REQ_IMMEDIATE_EXIT again as it thinks L0 wants
>>>> inject. That will send us into an endless loop.
>>>>
>>> Why would we trigger enable_irq_window()? enable_irq_window() triggers
>>> only if interrupt is pending in one of irq chips, not in architectural
>>> VCPU state.
>>
>> Precisely this is the case if an IRQ for L1 arrived while we tried to
>> enter L2 and caused the cancellation above.
>>
> But during next entry the cancelled interrupt is transfered
> from architectural VCPU state to VM_ENTRY_INTR_INFO_FIELD by
> inject_pending_event()->vmx_inject_irq(), so at the point where
> enable_irq_window() is called the state is exactly the same no matter
> whether we canceled interrupt or not during previous entry attempt. What
> am I missing?

Maybe that we normally either have an external IRQ pending in some IRQ
chip or in the VCPU architectural state, not both at the same time? By
transferring something that doesn't come from a virtual IRQ chip of L0
(but from the one in L1) into the architectural state, we break this
assumption.

> Oh may be I am missing that if we do not cancel interrupt
> then inject_pending_event() will skip
>   if (vcpu->arch.interrupt.pending)
>     ....

If we do not cancel, we will not inject at all (due to missing
KVM_REQ_EVENT).

> and will inject interrupt from APIC that caused cancellation of previous
> entry, but then this is a bug since this new interrupt will overwrite
> the one that is still in VM_ENTRY_INTR_INFO_FIELD from previous entry
> attempt and there may be another pending interrupt in APIC anyway that
> will cause enable_irq_window() too.

Maybe the issue is that we do not properly simulate a VMEXIT on an
external interrupt during vmrun (like SVM does). Need to check for this
case again...

Jan
Jan Kiszka Feb. 21, 2013, 10:28 a.m. UTC | #18
On 2013-02-21 11:18, Jan Kiszka wrote:
> On 2013-02-21 11:06, Gleb Natapov wrote:
>> On Thu, Feb 21, 2013 at 10:43:57AM +0100, Jan Kiszka wrote:
>>> On 2013-02-21 10:22, Gleb Natapov wrote:
>>>> On Wed, Feb 20, 2013 at 06:50:50PM +0100, Jan Kiszka wrote:
>>>>> On 2013-02-20 18:24, Jan Kiszka wrote:
>>>>>> On 2013-02-20 18:01, Gleb Natapov wrote:
>>>>>>> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
>>>>>>>> On 2013-02-20 15:14, Nadav Har'El wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> By the way, if you haven't seen my description of why the current code
>>>>>>>>> did what it did, take a look at
>>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
>>>>>>>>> Another description might also come in handy:
>>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
>>>>>>>>>
>>>>>>>>> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
>>>>>>>>>> This aligns VMX more with SVM regarding event injection and recovery for
>>>>>>>>>> nested guests. The changes allow to inject interrupts directly from L0
>>>>>>>>>> to L2.
>>>>>>>>>>
>>>>>>>>>> One difference to SVM is that we always transfer the pending event
>>>>>>>>>> injection into the architectural state of the VCPU and then drop it from
>>>>>>>>>> there if it turns out that we left L2 to enter L1.
>>>>>>>>>
>>>>>>>>> Last time I checked, if I'm remembering correctly, the nested SVM code did
>>>>>>>>> something a bit different: After the exit from L2 to L1 and unnecessarily
>>>>>>>>> queuing the pending interrupt for injection, it skipped one entry into L1,
>>>>>>>>> and as usual after the entry the interrupt queue is cleared so next time
>>>>>>>>> around, when L1 one is really entered, the wrong injection is not attempted.
>>>>>>>>>
>>>>>>>>>> VMX and SVM are now identical in how they recover event injections from
>>>>>>>>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>>>>>>>>>> still contains a valid event and, if yes, transfer the content into L1's
>>>>>>>>>> idt_vectoring_info_field.
>>>>>>>>>
>>>>>>>>>> To avoid that we incorrectly leak an event into the architectural VCPU
>>>>>>>>>> state that L1 wants to inject, we skip cancellation on nested run.
>>>>>>>>>
>>>>>>>>> I didn't understand this last point.
>>>>>>>>
>>>>>>>> - prepare_vmcs02 sets event to be injected into L2
>>>>>>>> - while trying to enter L2, a cancel condition is met
>>>>>>>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
>>>>>>>>   into the arch event queues - it's kept in vmcs12
>>>>>>>>
>>>>>>> But what if we put it in arch event queue? It will be reinjected during
>>>>>>> next entry attempt, so nothing bad happens and we have one less if() to explain,
>>>>>>> or do I miss something terrible that will happen?
>>>>>>
>>>>>> I started without that if but ran into troubles with KVM-on-KVM (L1
>>>>>> locks up). Let me dig out the instrumentation and check the event flow
>>>>>> again.
>>>>>
>>>>> OK, got it again: If we transfer an IRQ that L1 wants to send to L2 into
>>>>> the architectural VCPU state, we will also trigger enable_irq_window.
>>>>> And that raises KVM_REQ_IMMEDIATE_EXIT again as it thinks L0 wants
>>>>> inject. That will send us into an endless loop.
>>>>>
>>>> Why would we trigger enable_irq_window()? enable_irq_window() triggers
>>>> only if interrupt is pending in one of irq chips, not in architectural
>>>> VCPU state.
>>>
>>> Precisely this is the case if an IRQ for L1 arrived while we tried to
>>> enter L2 and caused the cancellation above.
>>>
>> But during next entry the cancelled interrupt is transfered
>> from architectural VCPU state to VM_ENTRY_INTR_INFO_FIELD by
>> inject_pending_event()->vmx_inject_irq(), so at the point where
>> enable_irq_window() is called the state is exactly the same no matter
>> whether we canceled interrupt or not during previous entry attempt. What
>> am I missing?
> 
> Maybe that we normally either have an external IRQ pending in some IRQ
> chip or in the VCPU architectural state, not both at the same time? By
> transferring something that doesn't come from a virtual IRQ chip of L0
> (but from the one in L1) into the architectural state, we break this
> assumption.
> 
>> Oh may be I am missing that if we do not cancel interrupt
>> then inject_pending_event() will skip
>>   if (vcpu->arch.interrupt.pending)
>>     ....
> 
> If we do not cancel, we will not inject at all (due to missing
> KVM_REQ_EVENT).
> 
>> and will inject interrupt from APIC that caused cancellation of previous
>> entry, but then this is a bug since this new interrupt will overwrite
>> the one that is still in VM_ENTRY_INTR_INFO_FIELD from previous entry
>> attempt and there may be another pending interrupt in APIC anyway that
>> will cause enable_irq_window() too.
> 
> Maybe the issue is that we do not properly simulate a VMEXIT on an
> external interrupt during vmrun (like SVM does). Need to check for this
> case again...

static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
{
	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
		if (to_vmx(vcpu)->nested.nested_run_pending ||
		    (vmcs12->idt_vectoring_info_field &
		     VECTORING_INFO_VALID_MASK))
			return 0;
		nested_vmx_vmexit(vcpu);
		vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
		vmcs12->vm_exit_intr_info = 0;
		...

I do not understand ATM why we refuse to simulate a vmexit due to an
external interrupt when we are about to run L2 or have something in
idt_vectoring_info_field. The external interrupt would not overwrite
idt_vectoring_info_field but should end up in vm_exit_intr_info.

Jan
Jan Kiszka Feb. 21, 2013, 10:33 a.m. UTC | #19
On 2013-02-21 11:28, Jan Kiszka wrote:
> On 2013-02-21 11:18, Jan Kiszka wrote:
>> On 2013-02-21 11:06, Gleb Natapov wrote:
>>> On Thu, Feb 21, 2013 at 10:43:57AM +0100, Jan Kiszka wrote:
>>>> On 2013-02-21 10:22, Gleb Natapov wrote:
>>>>> On Wed, Feb 20, 2013 at 06:50:50PM +0100, Jan Kiszka wrote:
>>>>>> On 2013-02-20 18:24, Jan Kiszka wrote:
>>>>>>> On 2013-02-20 18:01, Gleb Natapov wrote:
>>>>>>>> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
>>>>>>>>> On 2013-02-20 15:14, Nadav Har'El wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> By the way, if you haven't seen my description of why the current code
>>>>>>>>>> did what it did, take a look at
>>>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
>>>>>>>>>> Another description might also come in handy:
>>>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
>>>>>>>>>>> This aligns VMX more with SVM regarding event injection and recovery for
>>>>>>>>>>> nested guests. The changes allow to inject interrupts directly from L0
>>>>>>>>>>> to L2.
>>>>>>>>>>>
>>>>>>>>>>> One difference to SVM is that we always transfer the pending event
>>>>>>>>>>> injection into the architectural state of the VCPU and then drop it from
>>>>>>>>>>> there if it turns out that we left L2 to enter L1.
>>>>>>>>>>
>>>>>>>>>> Last time I checked, if I'm remembering correctly, the nested SVM code did
>>>>>>>>>> something a bit different: After the exit from L2 to L1 and unnecessarily
>>>>>>>>>> queuing the pending interrupt for injection, it skipped one entry into L1,
>>>>>>>>>> and as usual after the entry the interrupt queue is cleared so next time
>>>>>>>>>> around, when L1 one is really entered, the wrong injection is not attempted.
>>>>>>>>>>
>>>>>>>>>>> VMX and SVM are now identical in how they recover event injections from
>>>>>>>>>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>>>>>>>>>>> still contains a valid event and, if yes, transfer the content into L1's
>>>>>>>>>>> idt_vectoring_info_field.
>>>>>>>>>>
>>>>>>>>>>> To avoid that we incorrectly leak an event into the architectural VCPU
>>>>>>>>>>> state that L1 wants to inject, we skip cancellation on nested run.
>>>>>>>>>>
>>>>>>>>>> I didn't understand this last point.
>>>>>>>>>
>>>>>>>>> - prepare_vmcs02 sets event to be injected into L2
>>>>>>>>> - while trying to enter L2, a cancel condition is met
>>>>>>>>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
>>>>>>>>>   into the arch event queues - it's kept in vmcs12
>>>>>>>>>
>>>>>>>> But what if we put it in arch event queue? It will be reinjected during
>>>>>>>> next entry attempt, so nothing bad happens and we have one less if() to explain,
>>>>>>>> or do I miss something terrible that will happen?
>>>>>>>
>>>>>>> I started without that if but ran into troubles with KVM-on-KVM (L1
>>>>>>> locks up). Let me dig out the instrumentation and check the event flow
>>>>>>> again.
>>>>>>
>>>>>> OK, got it again: If we transfer an IRQ that L1 wants to send to L2 into
>>>>>> the architectural VCPU state, we will also trigger enable_irq_window.
>>>>>> And that raises KVM_REQ_IMMEDIATE_EXIT again as it thinks L0 wants
>>>>>> inject. That will send us into an endless loop.
>>>>>>
>>>>> Why would we trigger enable_irq_window()? enable_irq_window() triggers
>>>>> only if interrupt is pending in one of irq chips, not in architectural
>>>>> VCPU state.
>>>>
>>>> Precisely this is the case if an IRQ for L1 arrived while we tried to
>>>> enter L2 and caused the cancellation above.
>>>>
>>> But during next entry the cancelled interrupt is transfered
>>> from architectural VCPU state to VM_ENTRY_INTR_INFO_FIELD by
>>> inject_pending_event()->vmx_inject_irq(), so at the point where
>>> enable_irq_window() is called the state is exactly the same no matter
>>> whether we canceled interrupt or not during previous entry attempt. What
>>> am I missing?
>>
>> Maybe that we normally either have an external IRQ pending in some IRQ
>> chip or in the VCPU architectural state, not both at the same time? By
>> transferring something that doesn't come from a virtual IRQ chip of L0
>> (but from the one in L1) into the architectural state, we break this
>> assumption.
>>
>>> Oh may be I am missing that if we do not cancel interrupt
>>> then inject_pending_event() will skip
>>>   if (vcpu->arch.interrupt.pending)
>>>     ....
>>
>> If we do not cancel, we will not inject at all (due to missing
>> KVM_REQ_EVENT).
>>
>>> and will inject interrupt from APIC that caused cancellation of previous
>>> entry, but then this is a bug since this new interrupt will overwrite
>>> the one that is still in VM_ENTRY_INTR_INFO_FIELD from previous entry
>>> attempt and there may be another pending interrupt in APIC anyway that
>>> will cause enable_irq_window() too.
>>
>> Maybe the issue is that we do not properly simulate a VMEXIT on an
>> external interrupt during vmrun (like SVM does). Need to check for this
>> case again...
> 
> static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
> {
> 	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
> 		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> 		if (to_vmx(vcpu)->nested.nested_run_pending ||
> 		    (vmcs12->idt_vectoring_info_field &
> 		     VECTORING_INFO_VALID_MASK))
> 			return 0;
> 		nested_vmx_vmexit(vcpu);
> 		vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
> 		vmcs12->vm_exit_intr_info = 0;
> 		...
> 
> I do not understand ATM why we refuse to simulate a vmexit due to an
> external interrupt when we are about to run L2 or have something in
> idt_vectoring_info_field. The external interrupt would not overwrite
> idt_vectoring_info_field but should end up in vm_exit_intr_info.

Explained in 51cfe38ea5: idt_vectoring_info_field and vm_exit_intr_info
must not be valid at the same time.

Jan
Gleb Natapov Feb. 21, 2013, 1:13 p.m. UTC | #20
On Thu, Feb 21, 2013 at 11:33:30AM +0100, Jan Kiszka wrote:
> On 2013-02-21 11:28, Jan Kiszka wrote:
> > On 2013-02-21 11:18, Jan Kiszka wrote:
> >> On 2013-02-21 11:06, Gleb Natapov wrote:
> >>> On Thu, Feb 21, 2013 at 10:43:57AM +0100, Jan Kiszka wrote:
> >>>> On 2013-02-21 10:22, Gleb Natapov wrote:
> >>>>> On Wed, Feb 20, 2013 at 06:50:50PM +0100, Jan Kiszka wrote:
> >>>>>> On 2013-02-20 18:24, Jan Kiszka wrote:
> >>>>>>> On 2013-02-20 18:01, Gleb Natapov wrote:
> >>>>>>>> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
> >>>>>>>>> On 2013-02-20 15:14, Nadav Har'El wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> By the way, if you haven't seen my description of why the current code
> >>>>>>>>>> did what it did, take a look at
> >>>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
> >>>>>>>>>> Another description might also come in handy:
> >>>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
> >>>>>>>>>>> This aligns VMX more with SVM regarding event injection and recovery for
> >>>>>>>>>>> nested guests. The changes allow to inject interrupts directly from L0
> >>>>>>>>>>> to L2.
> >>>>>>>>>>>
> >>>>>>>>>>> One difference to SVM is that we always transfer the pending event
> >>>>>>>>>>> injection into the architectural state of the VCPU and then drop it from
> >>>>>>>>>>> there if it turns out that we left L2 to enter L1.
> >>>>>>>>>>
> >>>>>>>>>> Last time I checked, if I'm remembering correctly, the nested SVM code did
> >>>>>>>>>> something a bit different: After the exit from L2 to L1 and unnecessarily
> >>>>>>>>>> queuing the pending interrupt for injection, it skipped one entry into L1,
> >>>>>>>>>> and as usual after the entry the interrupt queue is cleared so next time
> >>>>>>>>>> around, when L1 one is really entered, the wrong injection is not attempted.
> >>>>>>>>>>
> >>>>>>>>>>> VMX and SVM are now identical in how they recover event injections from
> >>>>>>>>>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> >>>>>>>>>>> still contains a valid event and, if yes, transfer the content into L1's
> >>>>>>>>>>> idt_vectoring_info_field.
> >>>>>>>>>>
> >>>>>>>>>>> To avoid that we incorrectly leak an event into the architectural VCPU
> >>>>>>>>>>> state that L1 wants to inject, we skip cancellation on nested run.
> >>>>>>>>>>
> >>>>>>>>>> I didn't understand this last point.
> >>>>>>>>>
> >>>>>>>>> - prepare_vmcs02 sets event to be injected into L2
> >>>>>>>>> - while trying to enter L2, a cancel condition is met
> >>>>>>>>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
> >>>>>>>>>   into the arch event queues - it's kept in vmcs12
> >>>>>>>>>
> >>>>>>>> But what if we put it in arch event queue? It will be reinjected during
> >>>>>>>> next entry attempt, so nothing bad happens and we have one less if() to explain,
> >>>>>>>> or do I miss something terrible that will happen?
> >>>>>>>
> >>>>>>> I started without that if but ran into troubles with KVM-on-KVM (L1
> >>>>>>> locks up). Let me dig out the instrumentation and check the event flow
> >>>>>>> again.
> >>>>>>
> >>>>>> OK, got it again: If we transfer an IRQ that L1 wants to send to L2 into
> >>>>>> the architectural VCPU state, we will also trigger enable_irq_window.
> >>>>>> And that raises KVM_REQ_IMMEDIATE_EXIT again as it thinks L0 wants
> >>>>>> inject. That will send us into an endless loop.
> >>>>>>
> >>>>> Why would we trigger enable_irq_window()? enable_irq_window() triggers
> >>>>> only if interrupt is pending in one of irq chips, not in architectural
> >>>>> VCPU state.
> >>>>
> >>>> Precisely this is the case if an IRQ for L1 arrived while we tried to
> >>>> enter L2 and caused the cancellation above.
> >>>>
> >>> But during next entry the cancelled interrupt is transfered
> >>> from architectural VCPU state to VM_ENTRY_INTR_INFO_FIELD by
> >>> inject_pending_event()->vmx_inject_irq(), so at the point where
> >>> enable_irq_window() is called the state is exactly the same no matter
> >>> whether we canceled interrupt or not during previous entry attempt. What
> >>> am I missing?
> >>
> >> Maybe that we normally either have an external IRQ pending in some IRQ
> >> chip or in the VCPU architectural state, not both at the same time? By
> >> transferring something that doesn't come from a virtual IRQ chip of L0
> >> (but from the one in L1) into the architectural state, we break this
> >> assumption.
> >>
> >>> Oh may be I am missing that if we do not cancel interrupt
> >>> then inject_pending_event() will skip
> >>>   if (vcpu->arch.interrupt.pending)
> >>>     ....
> >>
> >> If we do not cancel, we will not inject at all (due to missing
> >> KVM_REQ_EVENT).
> >>
> >>> and will inject interrupt from APIC that caused cancellation of previous
> >>> entry, but then this is a bug since this new interrupt will overwrite
> >>> the one that is still in VM_ENTRY_INTR_INFO_FIELD from previous entry
> >>> attempt and there may be another pending interrupt in APIC anyway that
> >>> will cause enable_irq_window() too.
> >>
> >> Maybe the issue is that we do not properly simulate a VMEXIT on an
> >> external interrupt during vmrun (like SVM does). Need to check for this
> >> case again...
> > 
> > static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
> > {
> > 	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
> > 		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > 		if (to_vmx(vcpu)->nested.nested_run_pending ||
> > 		    (vmcs12->idt_vectoring_info_field &
> > 		     VECTORING_INFO_VALID_MASK))
> > 			return 0;
> > 		nested_vmx_vmexit(vcpu);
> > 		vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
> > 		vmcs12->vm_exit_intr_info = 0;
> > 		...
> > 
> > I do not understand ATM why we refuse to simulate a vmexit due to an
> > external interrupt when we are about to run L2 or have something in
> > idt_vectoring_info_field. The external interrupt would not overwrite
> > idt_vectoring_info_field but should end up in vm_exit_intr_info.
> 
> Explained in 51cfe38ea5: idt_vectoring_info_field and vm_exit_intr_info
> must not be valid at the same time.
> 
Interestingly, if we transfer interrupt from idt_vectoring_info into
arch VCPU state we can drop this check because vmx_interrupt_allowed()
will not be called while there is an event to reinject. 51cfe38ea5 still
does not explain why nested_run_pending is needed. We cannot #vmexit
without entering L2, but we can undo VMLAUNCH/VMRESUME emulation leaving
rip pointing to the instruction. We can start by moving
skip_emulated_instruction() from nested_vmx_run() to nested_vmx_vmexit().

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kiszka Feb. 21, 2013, 1:22 p.m. UTC | #21
On 2013-02-21 14:13, Gleb Natapov wrote:
> On Thu, Feb 21, 2013 at 11:33:30AM +0100, Jan Kiszka wrote:
>> On 2013-02-21 11:28, Jan Kiszka wrote:
>>> On 2013-02-21 11:18, Jan Kiszka wrote:
>>>> On 2013-02-21 11:06, Gleb Natapov wrote:
>>>>> On Thu, Feb 21, 2013 at 10:43:57AM +0100, Jan Kiszka wrote:
>>>>>> On 2013-02-21 10:22, Gleb Natapov wrote:
>>>>>>> On Wed, Feb 20, 2013 at 06:50:50PM +0100, Jan Kiszka wrote:
>>>>>>>> On 2013-02-20 18:24, Jan Kiszka wrote:
>>>>>>>>> On 2013-02-20 18:01, Gleb Natapov wrote:
>>>>>>>>>> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
>>>>>>>>>>> On 2013-02-20 15:14, Nadav Har'El wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> By the way, if you haven't seen my description of why the current code
>>>>>>>>>>>> did what it did, take a look at
>>>>>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
>>>>>>>>>>>> Another description might also come in handy:
>>>>>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery":
>>>>>>>>>>>>> This aligns VMX more with SVM regarding event injection and recovery for
>>>>>>>>>>>>> nested guests. The changes allow to inject interrupts directly from L0
>>>>>>>>>>>>> to L2.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One difference to SVM is that we always transfer the pending event
>>>>>>>>>>>>> injection into the architectural state of the VCPU and then drop it from
>>>>>>>>>>>>> there if it turns out that we left L2 to enter L1.
>>>>>>>>>>>>
>>>>>>>>>>>> Last time I checked, if I'm remembering correctly, the nested SVM code did
>>>>>>>>>>>> something a bit different: After the exit from L2 to L1 and unnecessarily
>>>>>>>>>>>> queuing the pending interrupt for injection, it skipped one entry into L1,
>>>>>>>>>>>> and as usual after the entry the interrupt queue is cleared so next time
>>>>>>>>>>>> around, when L1 one is really entered, the wrong injection is not attempted.
>>>>>>>>>>>>
>>>>>>>>>>>>> VMX and SVM are now identical in how they recover event injections from
>>>>>>>>>>>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>>>>>>>>>>>>> still contains a valid event and, if yes, transfer the content into L1's
>>>>>>>>>>>>> idt_vectoring_info_field.
>>>>>>>>>>>>
>>>>>>>>>>>>> To avoid that we incorrectly leak an event into the architectural VCPU
>>>>>>>>>>>>> state that L1 wants to inject, we skip cancellation on nested run.
>>>>>>>>>>>>
>>>>>>>>>>>> I didn't understand this last point.
>>>>>>>>>>>
>>>>>>>>>>> - prepare_vmcs02 sets event to be injected into L2
>>>>>>>>>>> - while trying to enter L2, a cancel condition is met
>>>>>>>>>>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
>>>>>>>>>>>   into the arch event queues - it's kept in vmcs12
>>>>>>>>>>>
>>>>>>>>>> But what if we put it in arch event queue? It will be reinjected during
>>>>>>>>>> next entry attempt, so nothing bad happens and we have one less if() to explain,
>>>>>>>>>> or do I miss something terrible that will happen?
>>>>>>>>>
>>>>>>>>> I started without that if but ran into troubles with KVM-on-KVM (L1
>>>>>>>>> locks up). Let me dig out the instrumentation and check the event flow
>>>>>>>>> again.
>>>>>>>>
>>>>>>>> OK, got it again: If we transfer an IRQ that L1 wants to send to L2 into
>>>>>>>> the architectural VCPU state, we will also trigger enable_irq_window.
>>>>>>>> And that raises KVM_REQ_IMMEDIATE_EXIT again as it thinks L0 wants
>>>>>>>> inject. That will send us into an endless loop.
>>>>>>>>
>>>>>>> Why would we trigger enable_irq_window()? enable_irq_window() triggers
>>>>>>> only if interrupt is pending in one of irq chips, not in architectural
>>>>>>> VCPU state.
>>>>>>
>>>>>> Precisely this is the case if an IRQ for L1 arrived while we tried to
>>>>>> enter L2 and caused the cancellation above.
>>>>>>
>>>>> But during next entry the cancelled interrupt is transfered
>>>>> from architectural VCPU state to VM_ENTRY_INTR_INFO_FIELD by
>>>>> inject_pending_event()->vmx_inject_irq(), so at the point where
>>>>> enable_irq_window() is called the state is exactly the same no matter
>>>>> whether we canceled interrupt or not during previous entry attempt. What
>>>>> am I missing?
>>>>
>>>> Maybe that we normally either have an external IRQ pending in some IRQ
>>>> chip or in the VCPU architectural state, not both at the same time? By
>>>> transferring something that doesn't come from a virtual IRQ chip of L0
>>>> (but from the one in L1) into the architectural state, we break this
>>>> assumption.
>>>>
>>>>> Oh may be I am missing that if we do not cancel interrupt
>>>>> then inject_pending_event() will skip
>>>>>   if (vcpu->arch.interrupt.pending)
>>>>>     ....
>>>>
>>>> If we do not cancel, we will not inject at all (due to missing
>>>> KVM_REQ_EVENT).
>>>>
>>>>> and will inject interrupt from APIC that caused cancellation of previous
>>>>> entry, but then this is a bug since this new interrupt will overwrite
>>>>> the one that is still in VM_ENTRY_INTR_INFO_FIELD from previous entry
>>>>> attempt and there may be another pending interrupt in APIC anyway that
>>>>> will cause enable_irq_window() too.
>>>>
>>>> Maybe the issue is that we do not properly simulate a VMEXIT on an
>>>> external interrupt during vmrun (like SVM does). Need to check for this
>>>> case again...
>>>
>>> static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
>>> {
>>> 	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
>>> 		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>>> 		if (to_vmx(vcpu)->nested.nested_run_pending ||
>>> 		    (vmcs12->idt_vectoring_info_field &
>>> 		     VECTORING_INFO_VALID_MASK))
>>> 			return 0;
>>> 		nested_vmx_vmexit(vcpu);
>>> 		vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
>>> 		vmcs12->vm_exit_intr_info = 0;
>>> 		...
>>>
>>> I do not understand ATM why we refuse to simulate a vmexit due to an
>>> external interrupt when we are about to run L2 or have something in
>>> idt_vectoring_info_field. The external interrupt would not overwrite
>>> idt_vectoring_info_field but should end up in vm_exit_intr_info.
>>
>> Explained in 51cfe38ea5: idt_vectoring_info_field and vm_exit_intr_info
>> must not be valid at the same time.
>>
> Interestingly, if we transfer interrupt from idt_vectoring_info into
> arch VCPU state we can drop this check because vmx_interrupt_allowed()
> will not be called while there is an event to reinject. 51cfe38ea5 still
> does not explain why nested_run_pending is needed. We cannot #vmexit
> without entering L2, but we can undo VMLAUNCH/VMRESUME emulation leaving
> rip pointing to the instruction. We can start by moving
> skip_emulated_instruction() from nested_vmx_run() to nested_vmx_vmexit().

That generally does not help to inject/report an external IRQ to L1 as
L1 runs with IRQs disabled around VMLAUNCH/RESUME. Thus, the only way to
report this IRQ is a VMEXIT. I think the ordering is hard: first inject
what L1 wants to send to L2, then VMEXIT with that external IRQ in
VM_EXIT_INTR_INFO.

Jan
Nadav Har'El Feb. 21, 2013, 1:28 p.m. UTC | #22
On Thu, Feb 21, 2013, Gleb Natapov wrote about "Re: [PATCH] KVM: nVMX: Rework event injection and recovery":
> will not be called while there is an event to reinject. 51cfe38ea5 still
> does not explain why nested_run_pending is needed. We cannot #vmexit
> without entering L2, but we can undo VMLAUNCH/VMRESUME emulation leaving
> rip pointing to the instruction. We can start by moving
> skip_emulated_instruction() from nested_vmx_run() to nested_vmx_vmexit().

This is a very interesting idea!
Don't forget to also skip_emulated_instruction() in nested_vmx_entry_failure().

And please expand the comment at the end of nested_vmx_run(), saying
that also skipping the instruction is done on exit, unless the
instruction needs to be retried because we needed to inject an
interrupt into L1 before running it.

Whether this is actually clearer than the "nested_run_pending" approach
I don't know.
Nadav Har'El Feb. 21, 2013, 1:37 p.m. UTC | #23
On Thu, Feb 21, 2013, Jan Kiszka wrote about "Re: [PATCH] KVM: nVMX: Rework event injection and recovery":
> That generally does not help to inject/report an external IRQ to L1 as
> L1 runs with IRQs disabled around VMLAUNCH/RESUME.

Good point, I forgot that :(

So it looks like nested_run_pending was necessary, after all.
Gleb Natapov Feb. 21, 2013, 1:45 p.m. UTC | #24
On Thu, Feb 21, 2013 at 03:37:16PM +0200, Nadav Har'El wrote:
> On Thu, Feb 21, 2013, Jan Kiszka wrote about "Re: [PATCH] KVM: nVMX: Rework event injection and recovery":
> > That generally does not help to inject/report an external IRQ to L1 as
> > L1 runs with IRQs disabled around VMLAUNCH/RESUME.
> 
> Good point, I forgot that :(
> 
> So it looks like nested_run_pending was necessary, after all.
> 
Not sure (as in "this is implementation detail that is possible to
avoid", not as in "this check here is incorrect!" :)). If interrupts
are disabled then vmx_interrupt_allowed() should return false because
interrupts are disabled, not because we emulating guest entry. This is
easier said that done though, but I'll think about it. Looks like SVM
does it this way.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index dd3a8a0..7d2fbd2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6489,8 +6489,6 @@  static void __vmx_complete_interrupts(struct vcpu_vmx *vmx,
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
-	if (is_guest_mode(&vmx->vcpu))
-		return;
 	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
 				  VM_EXIT_INSTRUCTION_LEN,
 				  IDT_VECTORING_ERROR_CODE);
@@ -6498,7 +6496,7 @@  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
-	if (is_guest_mode(vcpu))
+	if (to_vmx(vcpu)->nested.nested_run_pending)
 		return;
 	__vmx_complete_interrupts(to_vmx(vcpu),
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
@@ -6531,21 +6529,6 @@  static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	unsigned long debugctlmsr;
 
-	if (is_guest_mode(vcpu) && !vmx->nested.nested_run_pending) {
-		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-		if (vmcs12->idt_vectoring_info_field &
-				VECTORING_INFO_VALID_MASK) {
-			vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
-				vmcs12->idt_vectoring_info_field);
-			vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
-				vmcs12->vm_exit_instruction_len);
-			if (vmcs12->idt_vectoring_info_field &
-					VECTORING_INFO_DELIVER_CODE_MASK)
-				vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
-					vmcs12->idt_vectoring_error_code);
-		}
-	}
-
 	/* Record the guest's net vcpu time for enforced NMI injections. */
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
 		vmx->entry_time = ktime_get();
@@ -6704,17 +6687,6 @@  static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
-	if (is_guest_mode(vcpu)) {
-		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-		vmcs12->idt_vectoring_info_field = vmx->idt_vectoring_info;
-		if (vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK) {
-			vmcs12->idt_vectoring_error_code =
-				vmcs_read32(IDT_VECTORING_ERROR_CODE);
-			vmcs12->vm_exit_instruction_len =
-				vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
-		}
-	}
-
 	vmx->loaded_vmcs->launched = 1;
 
 	vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
@@ -7403,9 +7375,32 @@  void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
 	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
 
-	/* clear vm-entry fields which are to be cleared on exit */
-	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
+	/* drop what we picked up for L0 via vmx_complete_interrupts */
+	vcpu->arch.nmi_injected = false;
+	kvm_clear_exception_queue(vcpu);
+	kvm_clear_interrupt_queue(vcpu);
+
+	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) &&
+	    vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) {
+		/*
+		 * Preserve the event that was supposed to be injected
+		 * by emulating it would have been returned in
+		 * IDT_VECTORING_INFO_FIELD.
+		 */
+		if (vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) &
+		    INTR_INFO_VALID_MASK) {
+			vmcs12->idt_vectoring_info_field =
+				vmcs12->vm_entry_intr_info_field;
+			vmcs12->idt_vectoring_error_code =
+				vmcs12->vm_entry_exception_error_code;
+			vmcs12->vm_exit_instruction_len =
+				vmcs12->vm_entry_instruction_len;
+			vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
+		}
+
+		/* clear vm-entry fields which are to be cleared on exit */
 		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
+	}
 }
 
 /*