Message ID | 1404284054-51863-1-git-send-email-wanpeng.li@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> -----Original Message----- > From: Wanpeng Li [mailto:wanpeng.li@linux.intel.com] > Sent: Wednesday, July 2, 2014 2:54 PM > To: Paolo Bonzini; Jan Kiszka; Gleb Natapov > Cc: Hu, Robert; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; Wanpeng Li > Subject: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race > > This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 > > If we didn't inject a still-pending event to L1 since nested_run_pending, > KVM_REQ_EVENT should be requested after the vmexit in order to inject the > event to L1. However, current log blindly request a KVM_REQ_EVENT even if > there is no still-pending event to L1 which blocked by nested_run_pending. > There is a race which lead to an interrupt will be injected to L2 which > belong to L1 if L0 send an interrupt to L1 during this window. > > VCPU0 another thread > > L1 intr not blocked on L2 first entry > vmx_vcpu_run req event > kvm check request req event > check_nested_events don't have any intr > not nested exit > intr occur (8254, lapic timer > etc) > inject_pending_event now have intr > inject interrupt > > This patch fix this race by introduced a l1_events_blocked field in nested_vmx > which indicates there is still-pending event which blocked by > nested_run_pending, > and smart request a KVM_REQ_EVENT if there is a still-pending event which > blocked > by nested_run_pending. > > Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Tested-by: Robert Hu<robert.hu@intel.com> > --- > arch/x86/kvm/vmx.c | 20 +++++++++++++++----- > 1 file changed, 15 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index f4e5aed..fe69c49 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -372,6 +372,7 @@ struct nested_vmx { > u64 vmcs01_tsc_offset; > /* L2 must run next, and mustn't decide to exit to L1. */ > bool nested_run_pending; > + bool l1_events_blocked; > /* > * Guest pages referred to in vmcs02 with host-physical pointers, so > * we must keep them pinned while L2 runs. > @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu > *vcpu) > * we did not inject a still-pending event to L1 now because of > * nested_run_pending, we need to re-enable this bit. > */ > - if (vmx->nested.nested_run_pending) > + if (to_vmx(vcpu)->nested.l1_events_blocked) { > + to_vmx(vcpu)->nested.l1_events_blocked = false; > kvm_make_request(KVM_REQ_EVENT, vcpu); > + } > > vmx->nested.nested_run_pending = 0; > > @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct > kvm_vcpu *vcpu, bool external_intr) > > if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) && > vmx->nested.preemption_timer_expired) { > - if (vmx->nested.nested_run_pending) > + if (vmx->nested.nested_run_pending) { > + vmx->nested.l1_events_blocked = true; > return -EBUSY; > + } > nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); > return 0; > } > > if (vcpu->arch.nmi_pending && nested_exit_on_nmi(vcpu)) { > - if (vmx->nested.nested_run_pending || > - vcpu->arch.interrupt.pending) > + if (vmx->nested.nested_run_pending) { > + vmx->nested.l1_events_blocked = true; > + return -EBUSY; > + } > + if (vcpu->arch.interrupt.pending) > return -EBUSY; > nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, > NMI_VECTOR | INTR_TYPE_NMI_INTR | > @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu > *vcpu, bool external_intr) > > if ((kvm_cpu_has_interrupt(vcpu) || external_intr) && > nested_exit_on_intr(vcpu)) { > - if (vmx->nested.nested_run_pending) > + if (vmx->nested.nested_run_pending) { > + vmx->nested.l1_events_blocked = true; > return -EBUSY; > + } > nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, > 0); > } > > -- > 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2014-07-02 08:54, Wanpeng Li wrote: > This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 > > If we didn't inject a still-pending event to L1 since nested_run_pending, > KVM_REQ_EVENT should be requested after the vmexit in order to inject the > event to L1. However, current log blindly request a KVM_REQ_EVENT even if > there is no still-pending event to L1 which blocked by nested_run_pending. > There is a race which lead to an interrupt will be injected to L2 which > belong to L1 if L0 send an interrupt to L1 during this window. > > VCPU0 another thread > > L1 intr not blocked on L2 first entry > vmx_vcpu_run req event > kvm check request req event > check_nested_events don't have any intr > not nested exit > intr occur (8254, lapic timer etc) > inject_pending_event now have intr > inject interrupt > > This patch fix this race by introduced a l1_events_blocked field in nested_vmx > which indicates there is still-pending event which blocked by nested_run_pending, > and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked > by nested_run_pending. There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why aren't those able to trigger this scenario? In any case, unconditionally setting KVM_REQ_EVENT seems strange and should be changed. Jan > > Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> > --- > arch/x86/kvm/vmx.c | 20 +++++++++++++++----- > 1 file changed, 15 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index f4e5aed..fe69c49 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -372,6 +372,7 @@ struct nested_vmx { > u64 vmcs01_tsc_offset; > /* L2 must run next, and mustn't decide to exit to L1. */ > bool nested_run_pending; > + bool l1_events_blocked; > /* > * Guest pages referred to in vmcs02 with host-physical pointers, so > * we must keep them pinned while L2 runs. > @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) > * we did not inject a still-pending event to L1 now because of > * nested_run_pending, we need to re-enable this bit. > */ > - if (vmx->nested.nested_run_pending) > + if (to_vmx(vcpu)->nested.l1_events_blocked) { > + to_vmx(vcpu)->nested.l1_events_blocked = false; > kvm_make_request(KVM_REQ_EVENT, vcpu); > + } > > vmx->nested.nested_run_pending = 0; > > @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) > > if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) && > vmx->nested.preemption_timer_expired) { > - if (vmx->nested.nested_run_pending) > + if (vmx->nested.nested_run_pending) { > + vmx->nested.l1_events_blocked = true; > return -EBUSY; > + } > nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); > return 0; > } > > if (vcpu->arch.nmi_pending && nested_exit_on_nmi(vcpu)) { > - if (vmx->nested.nested_run_pending || > - vcpu->arch.interrupt.pending) > + if (vmx->nested.nested_run_pending) { > + vmx->nested.l1_events_blocked = true; > + return -EBUSY; > + } > + if (vcpu->arch.interrupt.pending) > return -EBUSY; > nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, > NMI_VECTOR | INTR_TYPE_NMI_INTR | > @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) > > if ((kvm_cpu_has_interrupt(vcpu) || external_intr) && > nested_exit_on_intr(vcpu)) { > - if (vmx->nested.nested_run_pending) > + if (vmx->nested.nested_run_pending) { > + vmx->nested.l1_events_blocked = true; > return -EBUSY; > + } > nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); > } > >
On 2014-07-02 09:20, Hu, Robert wrote: >> -----Original Message----- >> From: Wanpeng Li [mailto:wanpeng.li@linux.intel.com] >> Sent: Wednesday, July 2, 2014 2:54 PM >> To: Paolo Bonzini; Jan Kiszka; Gleb Natapov >> Cc: Hu, Robert; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; Wanpeng Li >> Subject: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race >> >> This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 >> >> If we didn't inject a still-pending event to L1 since nested_run_pending, >> KVM_REQ_EVENT should be requested after the vmexit in order to inject the >> event to L1. However, current log blindly request a KVM_REQ_EVENT even if >> there is no still-pending event to L1 which blocked by nested_run_pending. >> There is a race which lead to an interrupt will be injected to L2 which >> belong to L1 if L0 send an interrupt to L1 during this window. >> >> VCPU0 another thread >> >> L1 intr not blocked on L2 first entry >> vmx_vcpu_run req event >> kvm check request req event >> check_nested_events don't have any intr >> not nested exit >> intr occur (8254, lapic timer >> etc) >> inject_pending_event now have intr >> inject interrupt >> >> This patch fix this race by introduced a l1_events_blocked field in nested_vmx >> which indicates there is still-pending event which blocked by >> nested_run_pending, >> and smart request a KVM_REQ_EVENT if there is a still-pending event which >> blocked >> by nested_run_pending. >> >> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> > Tested-by: Robert Hu<robert.hu@intel.com> Do you happen to have a kvm-unit-test for this race? Or how did you test? Just curious, and if there is a test, it would be good to integrate it. Jan
> -----Original Message----- > From: Jan Kiszka [mailto:jan.kiszka@siemens.com] > Sent: Wednesday, July 2, 2014 5:03 PM > To: Hu, Robert; Wanpeng Li; Paolo Bonzini; Gleb Natapov > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since > race > > On 2014-07-02 09:20, Hu, Robert wrote: > >> -----Original Message----- > >> From: Wanpeng Li [mailto:wanpeng.li@linux.intel.com] > >> Sent: Wednesday, July 2, 2014 2:54 PM > >> To: Paolo Bonzini; Jan Kiszka; Gleb Natapov > >> Cc: Hu, Robert; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; Wanpeng > Li > >> Subject: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since > race > >> > >> This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 > >> > >> If we didn't inject a still-pending event to L1 since nested_run_pending, > >> KVM_REQ_EVENT should be requested after the vmexit in order to inject the > >> event to L1. However, current log blindly request a KVM_REQ_EVENT even if > >> there is no still-pending event to L1 which blocked by nested_run_pending. > >> There is a race which lead to an interrupt will be injected to L2 which > >> belong to L1 if L0 send an interrupt to L1 during this window. > >> > >> VCPU0 another thread > >> > >> L1 intr not blocked on L2 first entry > >> vmx_vcpu_run req event > >> kvm check request req event > >> check_nested_events don't have any intr > >> not nested exit > >> intr occur (8254, lapic timer > >> etc) > >> inject_pending_event now have intr > >> inject interrupt > >> > >> This patch fix this race by introduced a l1_events_blocked field in nested_vmx > >> which indicates there is still-pending event which blocked by > >> nested_run_pending, > >> and smart request a KVM_REQ_EVENT if there is a still-pending event which > >> blocked > >> by nested_run_pending. > >> > >> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> > > Tested-by: Robert Hu<robert.hu@intel.com> > > Do you happen to have a kvm-unit-test for this race? Or how did you > test? Just curious, and if there is a test, it would be good to > integrate it. I just apply the patch and test the reported scenario. > > Jan > > -- > Siemens AG, Corporate Technology, CT RTC ITP SES-DE > Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2014-07-02 11:13, Hu, Robert wrote: > >> -----Original Message----- >> From: Jan Kiszka [mailto:jan.kiszka@siemens.com] >> Sent: Wednesday, July 2, 2014 5:03 PM >> To: Hu, Robert; Wanpeng Li; Paolo Bonzini; Gleb Natapov >> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org >> Subject: Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since >> race >> >> On 2014-07-02 09:20, Hu, Robert wrote: >>>> -----Original Message----- >>>> From: Wanpeng Li [mailto:wanpeng.li@linux.intel.com] >>>> Sent: Wednesday, July 2, 2014 2:54 PM >>>> To: Paolo Bonzini; Jan Kiszka; Gleb Natapov >>>> Cc: Hu, Robert; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; Wanpeng >> Li >>>> Subject: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since >> race >>>> >>>> This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 >>>> >>>> If we didn't inject a still-pending event to L1 since nested_run_pending, >>>> KVM_REQ_EVENT should be requested after the vmexit in order to inject the >>>> event to L1. However, current log blindly request a KVM_REQ_EVENT even if >>>> there is no still-pending event to L1 which blocked by nested_run_pending. >>>> There is a race which lead to an interrupt will be injected to L2 which >>>> belong to L1 if L0 send an interrupt to L1 during this window. >>>> >>>> VCPU0 another thread >>>> >>>> L1 intr not blocked on L2 first entry >>>> vmx_vcpu_run req event >>>> kvm check request req event >>>> check_nested_events don't have any intr >>>> not nested exit >>>> intr occur (8254, lapic timer >>>> etc) >>>> inject_pending_event now have intr >>>> inject interrupt >>>> >>>> This patch fix this race by introduced a l1_events_blocked field in nested_vmx >>>> which indicates there is still-pending event which blocked by >>>> nested_run_pending, >>>> and smart request a KVM_REQ_EVENT if there is a still-pending event which >>>> blocked >>>> by nested_run_pending. >>>> >>>> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> >>> Tested-by: Robert Hu<robert.hu@intel.com> >> >> Do you happen to have a kvm-unit-test for this race? Or how did you >> test? Just curious, and if there is a test, it would be good to >> integrate it. > I just apply the patch and test the reported scenario. Ah, sorry, missed the referenced bug report. Jan
Wanpeng Li <wanpeng.li@linux.intel.com> writes: > This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 I can also reproduce this easily with Linux as L1 by "slowing it down" eg. running with ept = 0 I suggest changing the subject to - KVM: nVMX: Fix race that incorrectly injects L1's irq to L2 > If we didn't inject a still-pending event to L1 since nested_run_pending, > KVM_REQ_EVENT should be requested after the vmexit in order to inject the > event to L1. However, current log blindly request a KVM_REQ_EVENT even if What's current "log" ? Do you mean current "code" ? > there is no still-pending event to L1 which blocked by nested_run_pending. > There is a race which lead to an interrupt will be injected to L2 which > belong to L1 if L0 send an interrupt to L1 during this window. > > VCPU0 another thread > > L1 intr not blocked on L2 first entry > vmx_vcpu_run req event > kvm check request req event > check_nested_events don't have any intr > not nested exit > intr occur (8254, lapic timer etc) > inject_pending_event now have intr > inject interrupt > > This patch fix this race by introduced a l1_events_blocked field in nested_vmx > which indicates there is still-pending event which blocked by nested_run_pending, > and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked > by nested_run_pending. > > Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> > --- > arch/x86/kvm/vmx.c | 20 +++++++++++++++----- > 1 file changed, 15 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index f4e5aed..fe69c49 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -372,6 +372,7 @@ struct nested_vmx { > u64 vmcs01_tsc_offset; > /* L2 must run next, and mustn't decide to exit to L1. */ > bool nested_run_pending; > + bool l1_events_blocked; > /* > * Guest pages referred to in vmcs02 with host-physical pointers, so > * we must keep them pinned while L2 runs. > @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) > * we did not inject a still-pending event to L1 now because of > * nested_run_pending, we need to re-enable this bit. > */ > - if (vmx->nested.nested_run_pending) > + if (to_vmx(vcpu)->nested.l1_events_blocked) { Is to_vmx() necessary since we alredy have the vmx pointer ? > + to_vmx(vcpu)->nested.l1_events_blocked = false; > kvm_make_request(KVM_REQ_EVENT, vcpu); > + } > > vmx->nested.nested_run_pending = 0; > > @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) > > if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) && > vmx->nested.preemption_timer_expired) { > - if (vmx->nested.nested_run_pending) > + if (vmx->nested.nested_run_pending) { > + vmx->nested.l1_events_blocked = true; > return -EBUSY; > + } > nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); > return 0; > } > > if (vcpu->arch.nmi_pending && nested_exit_on_nmi(vcpu)) { > - if (vmx->nested.nested_run_pending || > - vcpu->arch.interrupt.pending) > + if (vmx->nested.nested_run_pending) { > + vmx->nested.l1_events_blocked = true; > + return -EBUSY; > + } > + if (vcpu->arch.interrupt.pending) > return -EBUSY; > nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, > NMI_VECTOR | INTR_TYPE_NMI_INTR | > @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) > > if ((kvm_cpu_has_interrupt(vcpu) || external_intr) && > nested_exit_on_intr(vcpu)) { > - if (vmx->nested.nested_run_pending) > + if (vmx->nested.nested_run_pending) { > + vmx->nested.l1_events_blocked = true; > return -EBUSY; > + } > nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); > } Also, I am wondering isn't it enough to just do this to avoid this race ? static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { - return (!to_vmx(vcpu)->nested.nested_run_pending && + return (!is_guest_mode(vcpu) && + !to_vmx(vcpu)->nested.nested_run_pending && vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & Thanks, Bandan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jan, On Wed, Jul 02, 2014 at 11:01:30AM +0200, Jan Kiszka wrote: >On 2014-07-02 08:54, Wanpeng Li wrote: >> This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 >> >> If we didn't inject a still-pending event to L1 since nested_run_pending, >> KVM_REQ_EVENT should be requested after the vmexit in order to inject the >> event to L1. However, current log blindly request a KVM_REQ_EVENT even if >> there is no still-pending event to L1 which blocked by nested_run_pending. >> There is a race which lead to an interrupt will be injected to L2 which >> belong to L1 if L0 send an interrupt to L1 during this window. >> >> VCPU0 another thread >> >> L1 intr not blocked on L2 first entry >> vmx_vcpu_run req event >> kvm check request req event >> check_nested_events don't have any intr >> not nested exit >> intr occur (8254, lapic timer etc) >> inject_pending_event now have intr >> inject interrupt >> >> This patch fix this race by introduced a l1_events_blocked field in nested_vmx >> which indicates there is still-pending event which blocked by nested_run_pending, >> and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked >> by nested_run_pending. > >There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why >aren't those able to trigger this scenario? > Other KVM_REQ_EVENT has a specific intr or pending , however, the KVM_REQ_EVENT which request after vmexit with nested_run_pending may or may not have a specific intr or pending. Regards, Wanpeng Li >In any case, unconditionally setting KVM_REQ_EVENT seems strange and >should be changed. > >Jan > >> >> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> >> --- >> arch/x86/kvm/vmx.c | 20 +++++++++++++++----- >> 1 file changed, 15 insertions(+), 5 deletions(-) >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >> index f4e5aed..fe69c49 100644 >> --- a/arch/x86/kvm/vmx.c >> +++ b/arch/x86/kvm/vmx.c >> @@ -372,6 +372,7 @@ struct nested_vmx { >> u64 vmcs01_tsc_offset; >> /* L2 must run next, and mustn't decide to exit to L1. */ >> bool nested_run_pending; >> + bool l1_events_blocked; >> /* >> * Guest pages referred to in vmcs02 with host-physical pointers, so >> * we must keep them pinned while L2 runs. >> @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) >> * we did not inject a still-pending event to L1 now because of >> * nested_run_pending, we need to re-enable this bit. >> */ >> - if (vmx->nested.nested_run_pending) >> + if (to_vmx(vcpu)->nested.l1_events_blocked) { >> + to_vmx(vcpu)->nested.l1_events_blocked = false; >> kvm_make_request(KVM_REQ_EVENT, vcpu); >> + } >> >> vmx->nested.nested_run_pending = 0; >> >> @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) >> >> if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) && >> vmx->nested.preemption_timer_expired) { >> - if (vmx->nested.nested_run_pending) >> + if (vmx->nested.nested_run_pending) { >> + vmx->nested.l1_events_blocked = true; >> return -EBUSY; >> + } >> nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); >> return 0; >> } >> >> if (vcpu->arch.nmi_pending && nested_exit_on_nmi(vcpu)) { >> - if (vmx->nested.nested_run_pending || >> - vcpu->arch.interrupt.pending) >> + if (vmx->nested.nested_run_pending) { >> + vmx->nested.l1_events_blocked = true; >> + return -EBUSY; >> + } >> + if (vcpu->arch.interrupt.pending) >> return -EBUSY; >> nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, >> NMI_VECTOR | INTR_TYPE_NMI_INTR | >> @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) >> >> if ((kvm_cpu_has_interrupt(vcpu) || external_intr) && >> nested_exit_on_intr(vcpu)) { >> - if (vmx->nested.nested_run_pending) >> + if (vmx->nested.nested_run_pending) { >> + vmx->nested.l1_events_blocked = true; >> return -EBUSY; >> + } >> nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); >> } >> >> > >-- >Siemens AG, Corporate Technology, CT RTC ITP SES-DE >Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Bandan, On Wed, Jul 02, 2014 at 12:27:59PM -0400, Bandan Das wrote: >Wanpeng Li <wanpeng.li@linux.intel.com> writes: > >> This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 >I can also reproduce this easily with Linux as L1 by "slowing it down" >eg. running with ept = 0 > >I suggest changing the subject to - >KVM: nVMX: Fix race that incorrectly injects L1's irq to L2 > Ok, I will fold this to next version. ;-) >> If we didn't inject a still-pending event to L1 since nested_run_pending, >> KVM_REQ_EVENT should be requested after the vmexit in order to inject the >> event to L1. However, current log blindly request a KVM_REQ_EVENT even if > >What's current "log" ? Do you mean current "code" ? > Yeah, it's a typo. I mean "logic". [...] >Also, I am wondering isn't it enough to just do this to avoid this race ? > > static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) > { >- return (!to_vmx(vcpu)->nested.nested_run_pending && >+ return (!is_guest_mode(vcpu) && >+ !to_vmx(vcpu)->nested.nested_run_pending && > vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && > !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & > I don't think you fix the root cause of the race, and there are two cases which I concern about your proposal: - If there is a special L1 which don't ask to exit on external intrs, you will lose the intrs which L0 inject to L2. - If inject_pending_event fail to inject an intr since the change that you made in vmx_interrupt_allowed, L0 will request an intr window for L2, however, the intr is belong to L1. Regards, Wanpeng Li >Thanks, >Bandan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Kiszka <jan.kiszka@siemens.com> writes: > On 2014-07-02 08:54, Wanpeng Li wrote: >> This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 >> >> If we didn't inject a still-pending event to L1 since nested_run_pending, >> KVM_REQ_EVENT should be requested after the vmexit in order to inject the >> event to L1. However, current log blindly request a KVM_REQ_EVENT even if >> there is no still-pending event to L1 which blocked by nested_run_pending. >> There is a race which lead to an interrupt will be injected to L2 which >> belong to L1 if L0 send an interrupt to L1 during this window. >> >> VCPU0 another thread >> >> L1 intr not blocked on L2 first entry >> vmx_vcpu_run req event >> kvm check request req event >> check_nested_events don't have any intr >> not nested exit >> intr occur (8254, lapic timer etc) >> inject_pending_event now have intr >> inject interrupt >> >> This patch fix this race by introduced a l1_events_blocked field in nested_vmx >> which indicates there is still-pending event which blocked by nested_run_pending, >> and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked >> by nested_run_pending. > > There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why > aren't those able to trigger this scenario? > > In any case, unconditionally setting KVM_REQ_EVENT seems strange and > should be changed. Ugh! I think I am hitting another one but this one's probably because we are not setting KVM_REQ_EVENT for something we should. Before this patch, I was able to hit this bug everytime with "modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0" and then booting L2. I can verify that I was indeed hitting the race in inject_pending_event. After this patch, I believe I am hitting another bug - this happens after I boot L2, as above, and then start a Linux kernel compilation and then wait and watch :) It's a pain to debug because this happens almost once in three times; it never happens if I run with ept=1, however, I think that's only because the test completes sooner. But I can confirm that I don't see it if I always set REQ_EVENT if nested_run_pending is set instead of the approach this patch takes. (Any debug hints help appreciated!) So, I am not sure if this is the right fix. Rather, I think the safer thing to do is to have the interrupt pending check for injection into L1 at the "same site" as the call to kvm_queue_interrupt() just like we had before commit b6b8a1451fc40412c57d1. Is there any advantage to having all the nested events checks together ? PS - Actually, a much easier fix (or rather hack) is to return 1 in vmx_interrupt_allowed() (as I mentioned elsewhere) only if !is_guest_mode(vcpu) That way, the pending interrupt interrupt can be taken care of correctly during the next vmexit. Bandan > Jan > >> >> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> >> --- >> arch/x86/kvm/vmx.c | 20 +++++++++++++++----- >> 1 file changed, 15 insertions(+), 5 deletions(-) >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >> index f4e5aed..fe69c49 100644 >> --- a/arch/x86/kvm/vmx.c >> +++ b/arch/x86/kvm/vmx.c >> @@ -372,6 +372,7 @@ struct nested_vmx { >> u64 vmcs01_tsc_offset; >> /* L2 must run next, and mustn't decide to exit to L1. */ >> bool nested_run_pending; >> + bool l1_events_blocked; >> /* >> * Guest pages referred to in vmcs02 with host-physical pointers, so >> * we must keep them pinned while L2 runs. >> @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) >> * we did not inject a still-pending event to L1 now because of >> * nested_run_pending, we need to re-enable this bit. >> */ >> - if (vmx->nested.nested_run_pending) >> + if (to_vmx(vcpu)->nested.l1_events_blocked) { >> + to_vmx(vcpu)->nested.l1_events_blocked = false; >> kvm_make_request(KVM_REQ_EVENT, vcpu); >> + } >> >> vmx->nested.nested_run_pending = 0; >> >> @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) >> >> if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) && >> vmx->nested.preemption_timer_expired) { >> - if (vmx->nested.nested_run_pending) >> + if (vmx->nested.nested_run_pending) { >> + vmx->nested.l1_events_blocked = true; >> return -EBUSY; >> + } >> nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); >> return 0; >> } >> >> if (vcpu->arch.nmi_pending && nested_exit_on_nmi(vcpu)) { >> - if (vmx->nested.nested_run_pending || >> - vcpu->arch.interrupt.pending) >> + if (vmx->nested.nested_run_pending) { >> + vmx->nested.l1_events_blocked = true; >> + return -EBUSY; >> + } >> + if (vcpu->arch.interrupt.pending) >> return -EBUSY; >> nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, >> NMI_VECTOR | INTR_TYPE_NMI_INTR | >> @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) >> >> if ((kvm_cpu_has_interrupt(vcpu) || external_intr) && >> nested_exit_on_intr(vcpu)) { >> - if (vmx->nested.nested_run_pending) >> + if (vmx->nested.nested_run_pending) { >> + vmx->nested.l1_events_blocked = true; >> return -EBUSY; >> + } >> nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); >> } >> >> -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Wanpeng Li <wanpeng.li@linux.intel.com> writes: > Hi Bandan, > On Wed, Jul 02, 2014 at 12:27:59PM -0400, Bandan Das wrote: >>Wanpeng Li <wanpeng.li@linux.intel.com> writes: >> >>> This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 >>I can also reproduce this easily with Linux as L1 by "slowing it down" >>eg. running with ept = 0 >> >>I suggest changing the subject to - >>KVM: nVMX: Fix race that incorrectly injects L1's irq to L2 >> > > Ok, I will fold this to next version. ;-) > >>> If we didn't inject a still-pending event to L1 since nested_run_pending, >>> KVM_REQ_EVENT should be requested after the vmexit in order to inject the >>> event to L1. However, current log blindly request a KVM_REQ_EVENT even if >> >>What's current "log" ? Do you mean current "code" ? >> > > Yeah, it's a typo. I mean "logic". > > [...] >>Also, I am wondering isn't it enough to just do this to avoid this race ? >> >> static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) >> { >>- return (!to_vmx(vcpu)->nested.nested_run_pending && >>+ return (!is_guest_mode(vcpu) && >>+ !to_vmx(vcpu)->nested.nested_run_pending && >> vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && >> !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & >> > > I don't think you fix the root cause of the race, and there are two cases which > I concern about your proposal: > > - If there is a special L1 which don't ask to exit on external intrs, you will > lose the intrs which L0 inject to L2. Oh didn't think about that case :), thanks for the pointing this out. It's easy to check this with Xen as L1, I suppose. > - If inject_pending_event fail to inject an intr since the change that you made > in vmx_interrupt_allowed, L0 will request an intr window for L2, however, the > intr is belong to L1. Oh, I thought inject_pending_event has kvm_queue_interrupt only if vmx_interrupt_allowed() returns true so, interrupt will be injected correctly on the next vmexit. Anyway, I am hitting another bug with your patch! Please see my other mail to the list. Thanks! > Regards, > Wanpeng Li > >>Thanks, >>Bandan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 03, 2014 at 01:15:26AM -0400, Bandan Das wrote: >Jan Kiszka <jan.kiszka@siemens.com> writes: > >> On 2014-07-02 08:54, Wanpeng Li wrote: >>> This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 >>> >>> If we didn't inject a still-pending event to L1 since nested_run_pending, >>> KVM_REQ_EVENT should be requested after the vmexit in order to inject the >>> event to L1. However, current log blindly request a KVM_REQ_EVENT even if >>> there is no still-pending event to L1 which blocked by nested_run_pending. >>> There is a race which lead to an interrupt will be injected to L2 which >>> belong to L1 if L0 send an interrupt to L1 during this window. >>> >>> VCPU0 another thread >>> >>> L1 intr not blocked on L2 first entry >>> vmx_vcpu_run req event >>> kvm check request req event >>> check_nested_events don't have any intr >>> not nested exit >>> intr occur (8254, lapic timer etc) >>> inject_pending_event now have intr >>> inject interrupt >>> >>> This patch fix this race by introduced a l1_events_blocked field in nested_vmx >>> which indicates there is still-pending event which blocked by nested_run_pending, >>> and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked >>> by nested_run_pending. >> >> There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why >> aren't those able to trigger this scenario? >> >> In any case, unconditionally setting KVM_REQ_EVENT seems strange and >> should be changed. > > >Ugh! I think I am hitting another one but this one's probably because >we are not setting KVM_REQ_EVENT for something we should. > >Before this patch, I was able to hit this bug everytime with >"modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0" and then booting >L2. I can verify that I was indeed hitting the race in inject_pending_event. > >After this patch, I believe I am hitting another bug - this happens >after I boot L2, as above, and then start a Linux kernel compilation >and then wait and watch :) It's a pain to debug because this happens I have already try several times with "modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0" and still can't reproduce the bug you mentioned. Could you give me more details such like L1 and L2 which one hang or panic? In addition, if you can post the call trace is appreciated. Regards, Wanpeng Li >almost once in three times; it never happens if I run with ept=1, however, >I think that's only because the test completes sooner. But I can confirm >that I don't see it if I always set REQ_EVENT if nested_run_pending is set instead of >the approach this patch takes. >(Any debug hints help appreciated!) > >So, I am not sure if this is the right fix. Rather, I think the safer thing >to do is to have the interrupt pending check for injection into L1 at >the "same site" as the call to kvm_queue_interrupt() just like we had before >commit b6b8a1451fc40412c57d1. Is there any advantage to having all the >nested events checks together ? > >PS - Actually, a much easier fix (or rather hack) is to return 1 in >vmx_interrupt_allowed() (as I mentioned elsewhere) only if >!is_guest_mode(vcpu) That way, the pending interrupt interrupt >can be taken care of correctly during the next vmexit. > >Bandan > >> Jan >> [...] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2014-07-03 07:29, Bandan Das wrote: > Wanpeng Li <wanpeng.li@linux.intel.com> writes: > >> Hi Bandan, >> On Wed, Jul 02, 2014 at 12:27:59PM -0400, Bandan Das wrote: >>> Wanpeng Li <wanpeng.li@linux.intel.com> writes: >>> >>>> This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 >>> I can also reproduce this easily with Linux as L1 by "slowing it down" >>> eg. running with ept = 0 >>> >>> I suggest changing the subject to - >>> KVM: nVMX: Fix race that incorrectly injects L1's irq to L2 >>> >> >> Ok, I will fold this to next version. ;-) >> >>>> If we didn't inject a still-pending event to L1 since nested_run_pending, >>>> KVM_REQ_EVENT should be requested after the vmexit in order to inject the >>>> event to L1. However, current log blindly request a KVM_REQ_EVENT even if >>> >>> What's current "log" ? Do you mean current "code" ? >>> >> >> Yeah, it's a typo. I mean "logic". >> >> [...] >>> Also, I am wondering isn't it enough to just do this to avoid this race ? >>> >>> static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) >>> { >>> - return (!to_vmx(vcpu)->nested.nested_run_pending && >>> + return (!is_guest_mode(vcpu) && >>> + !to_vmx(vcpu)->nested.nested_run_pending && >>> vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && >>> !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & >>> >> >> I don't think you fix the root cause of the race, and there are two cases which >> I concern about your proposal: >> >> - If there is a special L1 which don't ask to exit on external intrs, you will >> lose the intrs which L0 inject to L2. > > Oh didn't think about that case :), thanks for the pointing this out. > It's easy to check this with Xen as L1, I suppose. Xen most probably intercepts external interrupts, but Jailhouse definitely does not. We also have a unit test for that, but I will likely not expose the issue of lost events. Jan
Wanpeng Li <wanpeng.li@linux.intel.com> writes: > On Thu, Jul 03, 2014 at 01:15:26AM -0400, Bandan Das wrote: >>Jan Kiszka <jan.kiszka@siemens.com> writes: >> >>> On 2014-07-02 08:54, Wanpeng Li wrote: >>>> This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 >>>> >>>> If we didn't inject a still-pending event to L1 since nested_run_pending, >>>> KVM_REQ_EVENT should be requested after the vmexit in order to inject the >>>> event to L1. However, current log blindly request a KVM_REQ_EVENT even if >>>> there is no still-pending event to L1 which blocked by nested_run_pending. >>>> There is a race which lead to an interrupt will be injected to L2 which >>>> belong to L1 if L0 send an interrupt to L1 during this window. >>>> >>>> VCPU0 another thread >>>> >>>> L1 intr not blocked on L2 first entry >>>> vmx_vcpu_run req event >>>> kvm check request req event >>>> check_nested_events don't have any intr >>>> not nested exit >>>> intr occur (8254, lapic timer etc) >>>> inject_pending_event now have intr >>>> inject interrupt >>>> >>>> This patch fix this race by introduced a l1_events_blocked field in nested_vmx >>>> which indicates there is still-pending event which blocked by nested_run_pending, >>>> and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked >>>> by nested_run_pending. >>> >>> There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why >>> aren't those able to trigger this scenario? >>> >>> In any case, unconditionally setting KVM_REQ_EVENT seems strange and >>> should be changed. >> >> >>Ugh! I think I am hitting another one but this one's probably because >>we are not setting KVM_REQ_EVENT for something we should. >> >>Before this patch, I was able to hit this bug everytime with >>"modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0" and then booting >>L2. I can verify that I was indeed hitting the race in inject_pending_event. >> >>After this patch, I believe I am hitting another bug - this happens >>after I boot L2, as above, and then start a Linux kernel compilation >>and then wait and watch :) It's a pain to debug because this happens > > I have already try several times with "modprobe kvm_intel ept=0 nested=1 > enable_shadow_vmcs=0" and still can't reproduce the bug you mentioned. > Could you give me more details such like L1 and L2 which one hang or panic? > In addition, if you can post the call trace is appreciated. # modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0 The Host CPU - Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz qemu cmd to run L1 - # qemu-system-x86_64 -drive file=level1.img,if=virtio,id=disk0,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -drive file=level2.img,if=virtio,id=disk1,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -vnc :2 --enable-kvm -monitor stdio -m 4G -net nic,macaddr=00:23:32:45:89:10 -net tap,ifname=tap0,script=/etc/qemu-ifup,downscript=no -smp 4 -cpu Nehalem,+vmx -serial pty qemu cmd to run L2 - # sudo qemu-system-x86_64 -hda VM/level2.img -vnc :0 --enable-kvm -monitor stdio -m 2G -smp 2 -cpu Nehalem -redir tcp:5555::22 Additionally, L0 is FC19 with 3.16-rc3 L1 and L2 are Ubuntu 14.04 with 3.13.0-24-generic Then start a kernel compilation inside L2 with "make -j3" There's no call trace on L0, both L0 and L1 are hung (or rather really slow) and L1 serial spews out CPU softlock up errors. Enabling panic on softlockup on L1 will give a trace with smp_call_function_many() I think the corresponding code in kernel/smp.c that triggers this is WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() && !oops_in_progress && !early_boot_irqs_disabled); I know in most cases this is usually harmless, but in this specific case, it seems it's stuck here forever. Sorry, I don't have a L1 call trace handy atm, I can post that if you are interested. Note that this can take as much as 30 to 40 minutes to appear but once it does, you will know because both L1 and L2 will be stuck with the serial messages as I mentioned before. From my side, let me try this on another system to rule out any machine specific weirdness going on.. Please let me know if you need any further information. Thanks Bandan > Regards, > Wanpeng Li > >>almost once in three times; it never happens if I run with ept=1, however, >>I think that's only because the test completes sooner. But I can confirm >>that I don't see it if I always set REQ_EVENT if nested_run_pending is set instead of >>the approach this patch takes. >>(Any debug hints help appreciated!) >> >>So, I am not sure if this is the right fix. Rather, I think the safer thing >>to do is to have the interrupt pending check for injection into L1 at >>the "same site" as the call to kvm_queue_interrupt() just like we had before >>commit b6b8a1451fc40412c57d1. Is there any advantage to having all the >>nested events checks together ? >> >>PS - Actually, a much easier fix (or rather hack) is to return 1 in >>vmx_interrupt_allowed() (as I mentioned elsewhere) only if >>!is_guest_mode(vcpu) That way, the pending interrupt interrupt >>can be taken care of correctly during the next vmexit. >> >>Bandan >> >>> Jan >>> > [...] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 03, 2014 at 01:27:05PM -0400, Bandan Das wrote: [...] ># modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0 > >The Host CPU - Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz >qemu cmd to run L1 - ># qemu-system-x86_64 -drive file=level1.img,if=virtio,id=disk0,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -drive file=level2.img,if=virtio,id=disk1,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -vnc :2 --enable-kvm -monitor stdio -m 4G -net nic,macaddr=00:23:32:45:89:10 -net tap,ifname=tap0,script=/etc/qemu-ifup,downscript=no -smp 4 -cpu Nehalem,+vmx -serial pty > >qemu cmd to run L2 - ># sudo qemu-system-x86_64 -hda VM/level2.img -vnc :0 --enable-kvm -monitor stdio -m 2G -smp 2 -cpu Nehalem -redir tcp:5555::22 > >Additionally, >L0 is FC19 with 3.16-rc3 >L1 and L2 are Ubuntu 14.04 with 3.13.0-24-generic > >Then start a kernel compilation inside L2 with "make -j3" > >There's no call trace on L0, both L0 and L1 are hung (or rather really slow) and >L1 serial spews out CPU softlock up errors. Enabling panic on softlockup on L1 will give >a trace with smp_call_function_many() I think the corresponding code in kernel/smp.c that >triggers this is > >WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() > && !oops_in_progress && !early_boot_irqs_disabled); > >I know in most cases this is usually harmless, but in this specific case, >it seems it's stuck here forever. > >Sorry, I don't have a L1 call trace handy atm, I can post that if you are interested. > >Note that this can take as much as 30 to 40 minutes to appear but once it does, >you will know because both L1 and L2 will be stuck with the serial messages as I mentioned >before. From my side, let me try this on another system to rule out any machine specific >weirdness going on.. > Thanks for your pointing out. >Please let me know if you need any further information. > I just run kvm-unit-tests w/ vmx.flat and eventinj.flat. w/ vmx.flat and w/o my patch applied [...] Test suite : interrupt FAIL: direct interrupt while running guest PASS: intercepted interrupt while running guest FAIL: direct interrupt + hlt FAIL: intercepted interrupt + hlt FAIL: direct interrupt + activity state hlt FAIL: intercepted interrupt + activity state hlt PASS: running a guest with interrupt acknowledgement set SUMMARY: 69 tests, 6 failures w/ vmx.flat and w/ my patch applied [...] Test suite : interrupt PASS: direct interrupt while running guest PASS: intercepted interrupt while running guest PASS: direct interrupt + hlt FAIL: intercepted interrupt + hlt PASS: direct interrupt + activity state hlt PASS: intercepted interrupt + activity state hlt PASS: running a guest with interrupt acknowledgement set SUMMARY: 69 tests, 2 failures w/ eventinj.flat and w/o my patch applied SUMMARY: 13 tests, 0 failures w/ eventinj.flat and w/ my patch applied SUMMARY: 13 tests, 0 failures I'm not sure if the bug you mentioned has any relationship with "Fail: intercepted interrupt + hlt" which has already present before my patch. Regards, Wanpeng Li >Thanks >Bandan > >> Regards, >> Wanpeng Li >> >>>almost once in three times; it never happens if I run with ept=1, however, >>>I think that's only because the test completes sooner. But I can confirm >>>that I don't see it if I always set REQ_EVENT if nested_run_pending is set instead of >>>the approach this patch takes. >>>(Any debug hints help appreciated!) >>> >>>So, I am not sure if this is the right fix. Rather, I think the safer thing >>>to do is to have the interrupt pending check for injection into L1 at >>>the "same site" as the call to kvm_queue_interrupt() just like we had before >>>commit b6b8a1451fc40412c57d1. Is there any advantage to having all the >>>nested events checks together ? >>> >>>PS - Actually, a much easier fix (or rather hack) is to return 1 in >>>vmx_interrupt_allowed() (as I mentioned elsewhere) only if >>>!is_guest_mode(vcpu) That way, the pending interrupt interrupt >>>can be taken care of correctly during the next vmexit. >>> >>>Bandan >>> >>>> Jan >>>> >> [...] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2014-07-04 04:52, Wanpeng Li wrote: > On Thu, Jul 03, 2014 at 01:27:05PM -0400, Bandan Das wrote: > [...] >> # modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0 >> >> The Host CPU - Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz >> qemu cmd to run L1 - >> # qemu-system-x86_64 -drive file=level1.img,if=virtio,id=disk0,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -drive file=level2.img,if=virtio,id=disk1,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -vnc :2 --enable-kvm -monitor stdio -m 4G -net nic,macaddr=00:23:32:45:89:10 -net tap,ifname=tap0,script=/etc/qemu-ifup,downscript=no -smp 4 -cpu Nehalem,+vmx -serial pty >> >> qemu cmd to run L2 - >> # sudo qemu-system-x86_64 -hda VM/level2.img -vnc :0 --enable-kvm -monitor stdio -m 2G -smp 2 -cpu Nehalem -redir tcp:5555::22 >> >> Additionally, >> L0 is FC19 with 3.16-rc3 >> L1 and L2 are Ubuntu 14.04 with 3.13.0-24-generic >> >> Then start a kernel compilation inside L2 with "make -j3" >> >> There's no call trace on L0, both L0 and L1 are hung (or rather really slow) and >> L1 serial spews out CPU softlock up errors. Enabling panic on softlockup on L1 will give >> a trace with smp_call_function_many() I think the corresponding code in kernel/smp.c that >> triggers this is >> >> WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() >> && !oops_in_progress && !early_boot_irqs_disabled); >> >> I know in most cases this is usually harmless, but in this specific case, >> it seems it's stuck here forever. >> >> Sorry, I don't have a L1 call trace handy atm, I can post that if you are interested. >> >> Note that this can take as much as 30 to 40 minutes to appear but once it does, >> you will know because both L1 and L2 will be stuck with the serial messages as I mentioned >> before. From my side, let me try this on another system to rule out any machine specific >> weirdness going on.. >> > > Thanks for your pointing out. > >> Please let me know if you need any further information. >> > > I just run kvm-unit-tests w/ vmx.flat and eventinj.flat. > > > w/ vmx.flat and w/o my patch applied > > [...] > > Test suite : interrupt > FAIL: direct interrupt while running guest > PASS: intercepted interrupt while running guest > FAIL: direct interrupt + hlt > FAIL: intercepted interrupt + hlt > FAIL: direct interrupt + activity state hlt > FAIL: intercepted interrupt + activity state hlt > PASS: running a guest with interrupt acknowledgement set > SUMMARY: 69 tests, 6 failures > > w/ vmx.flat and w/ my patch applied > > [...] > > Test suite : interrupt > PASS: direct interrupt while running guest > PASS: intercepted interrupt while running guest > PASS: direct interrupt + hlt > FAIL: intercepted interrupt + hlt > PASS: direct interrupt + activity state hlt > PASS: intercepted interrupt + activity state hlt > PASS: running a guest with interrupt acknowledgement set > > SUMMARY: 69 tests, 2 failures Which version (hash) of kvm-unit-tests are you using? All tests up to 307621765a are running fine here, but since a0e30e712d not much is completing successfully anymore: enabling apic paging enabled cr0 = 80010011 cr3 = 7fff000 cr4 = 20 PASS: test vmxon with FEATURE_CONTROL cleared PASS: test vmxon without FEATURE_CONTROL lock PASS: test enable VMX in FEATURE_CONTROL PASS: test FEATURE_CONTROL lock bit PASS: test vmxon FAIL: test vmptrld PASS: test vmclear init_vmcs : make_vmcs_current error FAIL: test vmptrst init_vmcs : make_vmcs_current error vmx_run : vmlaunch failed. FAIL: test vmlaunch FAIL: test vmlaunch SUMMARY: 10 tests, 4 unexpected failures Jan
On Fri, Jul 04, 2014 at 07:43:14AM +0200, Jan Kiszka wrote: >On 2014-07-04 04:52, Wanpeng Li wrote: >> On Thu, Jul 03, 2014 at 01:27:05PM -0400, Bandan Das wrote: >> [...] >>> # modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0 >>> >>> The Host CPU - Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz >>> qemu cmd to run L1 - >>> # qemu-system-x86_64 -drive file=level1.img,if=virtio,id=disk0,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -drive file=level2.img,if=virtio,id=disk1,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -vnc :2 --enable-kvm -monitor stdio -m 4G -net nic,macaddr=00:23:32:45:89:10 -net tap,ifname=tap0,script=/etc/qemu-ifup,downscript=no -smp 4 -cpu Nehalem,+vmx -serial pty >>> >>> qemu cmd to run L2 - >>> # sudo qemu-system-x86_64 -hda VM/level2.img -vnc :0 --enable-kvm -monitor stdio -m 2G -smp 2 -cpu Nehalem -redir tcp:5555::22 >>> >>> Additionally, >>> L0 is FC19 with 3.16-rc3 >>> L1 and L2 are Ubuntu 14.04 with 3.13.0-24-generic >>> >>> Then start a kernel compilation inside L2 with "make -j3" >>> >>> There's no call trace on L0, both L0 and L1 are hung (or rather really slow) and >>> L1 serial spews out CPU softlock up errors. Enabling panic on softlockup on L1 will give >>> a trace with smp_call_function_many() I think the corresponding code in kernel/smp.c that >>> triggers this is >>> >>> WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() >>> && !oops_in_progress && !early_boot_irqs_disabled); >>> >>> I know in most cases this is usually harmless, but in this specific case, >>> it seems it's stuck here forever. >>> >>> Sorry, I don't have a L1 call trace handy atm, I can post that if you are interested. >>> >>> Note that this can take as much as 30 to 40 minutes to appear but once it does, >>> you will know because both L1 and L2 will be stuck with the serial messages as I mentioned >>> before. From my side, let me try this on another system to rule out any machine specific >>> weirdness going on.. >>> >> >> Thanks for your pointing out. >> >>> Please let me know if you need any further information. >>> >> >> I just run kvm-unit-tests w/ vmx.flat and eventinj.flat. >> >> >> w/ vmx.flat and w/o my patch applied >> >> [...] >> >> Test suite : interrupt >> FAIL: direct interrupt while running guest >> PASS: intercepted interrupt while running guest >> FAIL: direct interrupt + hlt >> FAIL: intercepted interrupt + hlt >> FAIL: direct interrupt + activity state hlt >> FAIL: intercepted interrupt + activity state hlt >> PASS: running a guest with interrupt acknowledgement set >> SUMMARY: 69 tests, 6 failures >> >> w/ vmx.flat and w/ my patch applied >> >> [...] >> >> Test suite : interrupt >> PASS: direct interrupt while running guest >> PASS: intercepted interrupt while running guest >> PASS: direct interrupt + hlt >> FAIL: intercepted interrupt + hlt >> PASS: direct interrupt + activity state hlt >> PASS: intercepted interrupt + activity state hlt >> PASS: running a guest with interrupt acknowledgement set >> >> SUMMARY: 69 tests, 2 failures > >Which version (hash) of kvm-unit-tests are you using? All tests up to >307621765a are running fine here, but since a0e30e712d not much is >completing successfully anymore: > I just git pull my kvm-unit-tests to latest, the last commit is daeec9795d. >enabling apic >paging enabled >cr0 = 80010011 >cr3 = 7fff000 >cr4 = 20 >PASS: test vmxon with FEATURE_CONTROL cleared >PASS: test vmxon without FEATURE_CONTROL lock >PASS: test enable VMX in FEATURE_CONTROL >PASS: test FEATURE_CONTROL lock bit >PASS: test vmxon >FAIL: test vmptrld >PASS: test vmclear >init_vmcs : make_vmcs_current error >FAIL: test vmptrst >init_vmcs : make_vmcs_current error >vmx_run : vmlaunch failed. >FAIL: test vmlaunch >FAIL: test vmlaunch > >SUMMARY: 10 tests, 4 unexpected failures /opt/qemu/bin/qemu-system-x86_64 -enable-kvm -device pc-testdev -serial stdio -device isa-debug-exit,iobase=0xf4,iosize=0x4 -kernel ./x86/vmx.flat -cpu host Test suite : interrupt PASS: direct interrupt while running guest PASS: intercepted interrupt while running guest PASS: direct interrupt + hlt FAIL: intercepted interrupt + hlt PASS: direct interrupt + activity state hlt PASS: intercepted interrupt + activity state hlt PASS: running a guest with interrupt acknowledgement set SUMMARY: 69 tests, 2 failures However, interrupted interrupt + hlt always fail w/ and w/o my patch. Regards, Wanpeng Li > > >Jan > >-- >Siemens AG, Corporate Technology, CT RTC ITP SES-DE >Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 03, 2014 at 01:15:26AM -0400, Bandan Das wrote: >Jan Kiszka <jan.kiszka@siemens.com> writes: > >> On 2014-07-02 08:54, Wanpeng Li wrote: >>> This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 >>> >>> If we didn't inject a still-pending event to L1 since nested_run_pending, >>> KVM_REQ_EVENT should be requested after the vmexit in order to inject the >>> event to L1. However, current log blindly request a KVM_REQ_EVENT even if >>> there is no still-pending event to L1 which blocked by nested_run_pending. >>> There is a race which lead to an interrupt will be injected to L2 which >>> belong to L1 if L0 send an interrupt to L1 during this window. >>> >>> VCPU0 another thread >>> >>> L1 intr not blocked on L2 first entry >>> vmx_vcpu_run req event >>> kvm check request req event >>> check_nested_events don't have any intr >>> not nested exit >>> intr occur (8254, lapic timer etc) >>> inject_pending_event now have intr >>> inject interrupt >>> >>> This patch fix this race by introduced a l1_events_blocked field in nested_vmx >>> which indicates there is still-pending event which blocked by nested_run_pending, >>> and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked >>> by nested_run_pending. >> >> There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why >> aren't those able to trigger this scenario? >> >> In any case, unconditionally setting KVM_REQ_EVENT seems strange and >> should be changed. > > >Ugh! I think I am hitting another one but this one's probably because >we are not setting KVM_REQ_EVENT for something we should. > >Before this patch, I was able to hit this bug everytime with >"modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0" and then booting >L2. I can verify that I was indeed hitting the race in inject_pending_event. > >After this patch, I believe I am hitting another bug - this happens >after I boot L2, as above, and then start a Linux kernel compilation >and then wait and watch :) It's a pain to debug because this happens >almost once in three times; it never happens if I run with ept=1, however, >I think that's only because the test completes sooner. But I can confirm >that I don't see it if I always set REQ_EVENT if nested_run_pending is set instead of >the approach this patch takes. >(Any debug hints help appreciated!) > >So, I am not sure if this is the right fix. Rather, I think the safer thing >to do is to have the interrupt pending check for injection into L1 at >the "same site" as the call to kvm_queue_interrupt() just like we had before >commit b6b8a1451fc40412c57d1. Is there any advantage to having all the >nested events checks together ? > How about revert commit b6b8a1451 and try if the bug which you mentioned is still there? Regards, Wanpeng Li >PS - Actually, a much easier fix (or rather hack) is to return 1 in >vmx_interrupt_allowed() (as I mentioned elsewhere) only if >!is_guest_mode(vcpu) That way, the pending interrupt interrupt >can be taken care of correctly during the next vmexit. > >Bandan > >> Jan >> >>> >>> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> >>> --- >>> arch/x86/kvm/vmx.c | 20 +++++++++++++++----- >>> 1 file changed, 15 insertions(+), 5 deletions(-) >>> >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >>> index f4e5aed..fe69c49 100644 >>> --- a/arch/x86/kvm/vmx.c >>> +++ b/arch/x86/kvm/vmx.c >>> @@ -372,6 +372,7 @@ struct nested_vmx { >>> u64 vmcs01_tsc_offset; >>> /* L2 must run next, and mustn't decide to exit to L1. */ >>> bool nested_run_pending; >>> + bool l1_events_blocked; >>> /* >>> * Guest pages referred to in vmcs02 with host-physical pointers, so >>> * we must keep them pinned while L2 runs. >>> @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) >>> * we did not inject a still-pending event to L1 now because of >>> * nested_run_pending, we need to re-enable this bit. >>> */ >>> - if (vmx->nested.nested_run_pending) >>> + if (to_vmx(vcpu)->nested.l1_events_blocked) { >>> + to_vmx(vcpu)->nested.l1_events_blocked = false; >>> kvm_make_request(KVM_REQ_EVENT, vcpu); >>> + } >>> >>> vmx->nested.nested_run_pending = 0; >>> >>> @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) >>> >>> if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) && >>> vmx->nested.preemption_timer_expired) { >>> - if (vmx->nested.nested_run_pending) >>> + if (vmx->nested.nested_run_pending) { >>> + vmx->nested.l1_events_blocked = true; >>> return -EBUSY; >>> + } >>> nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); >>> return 0; >>> } >>> >>> if (vcpu->arch.nmi_pending && nested_exit_on_nmi(vcpu)) { >>> - if (vmx->nested.nested_run_pending || >>> - vcpu->arch.interrupt.pending) >>> + if (vmx->nested.nested_run_pending) { >>> + vmx->nested.l1_events_blocked = true; >>> + return -EBUSY; >>> + } >>> + if (vcpu->arch.interrupt.pending) >>> return -EBUSY; >>> nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, >>> NMI_VECTOR | INTR_TYPE_NMI_INTR | >>> @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) >>> >>> if ((kvm_cpu_has_interrupt(vcpu) || external_intr) && >>> nested_exit_on_intr(vcpu)) { >>> - if (vmx->nested.nested_run_pending) >>> + if (vmx->nested.nested_run_pending) { >>> + vmx->nested.l1_events_blocked = true; >>> return -EBUSY; >>> + } >>> nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); >>> } >>> >>> -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2014-07-04 08:08, Wanpeng Li wrote: > On Fri, Jul 04, 2014 at 07:43:14AM +0200, Jan Kiszka wrote: >> On 2014-07-04 04:52, Wanpeng Li wrote: >>> On Thu, Jul 03, 2014 at 01:27:05PM -0400, Bandan Das wrote: >>> [...] >>>> # modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0 >>>> >>>> The Host CPU - Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz >>>> qemu cmd to run L1 - >>>> # qemu-system-x86_64 -drive file=level1.img,if=virtio,id=disk0,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -drive file=level2.img,if=virtio,id=disk1,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -vnc :2 --enable-kvm -monitor stdio -m 4G -net nic,macaddr=00:23:32:45:89:10 -net tap,ifname=tap0,script=/etc/qemu-ifup,downscript=no -smp 4 -cpu Nehalem,+vmx -serial pty >>>> >>>> qemu cmd to run L2 - >>>> # sudo qemu-system-x86_64 -hda VM/level2.img -vnc :0 --enable-kvm -monitor stdio -m 2G -smp 2 -cpu Nehalem -redir tcp:5555::22 >>>> >>>> Additionally, >>>> L0 is FC19 with 3.16-rc3 >>>> L1 and L2 are Ubuntu 14.04 with 3.13.0-24-generic >>>> >>>> Then start a kernel compilation inside L2 with "make -j3" >>>> >>>> There's no call trace on L0, both L0 and L1 are hung (or rather really slow) and >>>> L1 serial spews out CPU softlock up errors. Enabling panic on softlockup on L1 will give >>>> a trace with smp_call_function_many() I think the corresponding code in kernel/smp.c that >>>> triggers this is >>>> >>>> WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() >>>> && !oops_in_progress && !early_boot_irqs_disabled); >>>> >>>> I know in most cases this is usually harmless, but in this specific case, >>>> it seems it's stuck here forever. >>>> >>>> Sorry, I don't have a L1 call trace handy atm, I can post that if you are interested. >>>> >>>> Note that this can take as much as 30 to 40 minutes to appear but once it does, >>>> you will know because both L1 and L2 will be stuck with the serial messages as I mentioned >>>> before. From my side, let me try this on another system to rule out any machine specific >>>> weirdness going on.. >>>> >>> >>> Thanks for your pointing out. >>> >>>> Please let me know if you need any further information. >>>> >>> >>> I just run kvm-unit-tests w/ vmx.flat and eventinj.flat. >>> >>> >>> w/ vmx.flat and w/o my patch applied >>> >>> [...] >>> >>> Test suite : interrupt >>> FAIL: direct interrupt while running guest >>> PASS: intercepted interrupt while running guest >>> FAIL: direct interrupt + hlt >>> FAIL: intercepted interrupt + hlt >>> FAIL: direct interrupt + activity state hlt >>> FAIL: intercepted interrupt + activity state hlt >>> PASS: running a guest with interrupt acknowledgement set >>> SUMMARY: 69 tests, 6 failures >>> >>> w/ vmx.flat and w/ my patch applied >>> >>> [...] >>> >>> Test suite : interrupt >>> PASS: direct interrupt while running guest >>> PASS: intercepted interrupt while running guest >>> PASS: direct interrupt + hlt >>> FAIL: intercepted interrupt + hlt >>> PASS: direct interrupt + activity state hlt >>> PASS: intercepted interrupt + activity state hlt >>> PASS: running a guest with interrupt acknowledgement set >>> >>> SUMMARY: 69 tests, 2 failures >> >> Which version (hash) of kvm-unit-tests are you using? All tests up to >> 307621765a are running fine here, but since a0e30e712d not much is >> completing successfully anymore: >> > > I just git pull my kvm-unit-tests to latest, the last commit is daeec9795d. > >> enabling apic >> paging enabled >> cr0 = 80010011 >> cr3 = 7fff000 >> cr4 = 20 >> PASS: test vmxon with FEATURE_CONTROL cleared >> PASS: test vmxon without FEATURE_CONTROL lock >> PASS: test enable VMX in FEATURE_CONTROL >> PASS: test FEATURE_CONTROL lock bit >> PASS: test vmxon >> FAIL: test vmptrld >> PASS: test vmclear >> init_vmcs : make_vmcs_current error >> FAIL: test vmptrst >> init_vmcs : make_vmcs_current error >> vmx_run : vmlaunch failed. >> FAIL: test vmlaunch >> FAIL: test vmlaunch >> >> SUMMARY: 10 tests, 4 unexpected failures > > > /opt/qemu/bin/qemu-system-x86_64 -enable-kvm -device pc-testdev -serial stdio > -device isa-debug-exit,iobase=0xf4,iosize=0x4 -kernel ./x86/vmx.flat -cpu host > > Test suite : interrupt > PASS: direct interrupt while running guest > PASS: intercepted interrupt while running guest > PASS: direct interrupt + hlt > FAIL: intercepted interrupt + hlt > PASS: direct interrupt + activity state hlt > PASS: intercepted interrupt + activity state hlt > PASS: running a guest with interrupt acknowledgement set > > SUMMARY: 69 tests, 2 failures Somehow I'm missing the other 31 vmx test we have now... Could you post the full log? Please also post the output of qemu/scripts/kvm/vmxcap on your test host to compare with what I have here. Thanks, Jan
On 2014-07-04 08:17, Wanpeng Li wrote: > On Thu, Jul 03, 2014 at 01:15:26AM -0400, Bandan Das wrote: >> Jan Kiszka <jan.kiszka@siemens.com> writes: >> >>> On 2014-07-02 08:54, Wanpeng Li wrote: >>>> This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 >>>> >>>> If we didn't inject a still-pending event to L1 since nested_run_pending, >>>> KVM_REQ_EVENT should be requested after the vmexit in order to inject the >>>> event to L1. However, current log blindly request a KVM_REQ_EVENT even if >>>> there is no still-pending event to L1 which blocked by nested_run_pending. >>>> There is a race which lead to an interrupt will be injected to L2 which >>>> belong to L1 if L0 send an interrupt to L1 during this window. >>>> >>>> VCPU0 another thread >>>> >>>> L1 intr not blocked on L2 first entry >>>> vmx_vcpu_run req event >>>> kvm check request req event >>>> check_nested_events don't have any intr >>>> not nested exit >>>> intr occur (8254, lapic timer etc) >>>> inject_pending_event now have intr >>>> inject interrupt >>>> >>>> This patch fix this race by introduced a l1_events_blocked field in nested_vmx >>>> which indicates there is still-pending event which blocked by nested_run_pending, >>>> and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked >>>> by nested_run_pending. >>> >>> There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why >>> aren't those able to trigger this scenario? >>> >>> In any case, unconditionally setting KVM_REQ_EVENT seems strange and >>> should be changed. >> >> >> Ugh! I think I am hitting another one but this one's probably because >> we are not setting KVM_REQ_EVENT for something we should. >> >> Before this patch, I was able to hit this bug everytime with >> "modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0" and then booting >> L2. I can verify that I was indeed hitting the race in inject_pending_event. >> >> After this patch, I believe I am hitting another bug - this happens >> after I boot L2, as above, and then start a Linux kernel compilation >> and then wait and watch :) It's a pain to debug because this happens >> almost once in three times; it never happens if I run with ept=1, however, >> I think that's only because the test completes sooner. But I can confirm >> that I don't see it if I always set REQ_EVENT if nested_run_pending is set instead of >> the approach this patch takes. >> (Any debug hints help appreciated!) >> >> So, I am not sure if this is the right fix. Rather, I think the safer thing >> to do is to have the interrupt pending check for injection into L1 at >> the "same site" as the call to kvm_queue_interrupt() just like we had before >> commit b6b8a1451fc40412c57d1. Is there any advantage to having all the >> nested events checks together ? >> > > How about revert commit b6b8a1451 and try if the bug which you mentioned > is still there? I suspect you will have to reset back to b6b8a1451^ for this as other changes depend on this commit now. Jan
On Fri, Jul 04, 2014 at 09:19:54AM +0200, Jan Kiszka wrote: >On 2014-07-04 08:08, Wanpeng Li wrote: >> On Fri, Jul 04, 2014 at 07:43:14AM +0200, Jan Kiszka wrote: >>> On 2014-07-04 04:52, Wanpeng Li wrote: >>>> On Thu, Jul 03, 2014 at 01:27:05PM -0400, Bandan Das wrote: >>>> [...] >>>>> # modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0 >>>>> >>>>> The Host CPU - Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz >>>>> qemu cmd to run L1 - >>>>> # qemu-system-x86_64 -drive file=level1.img,if=virtio,id=disk0,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -drive file=level2.img,if=virtio,id=disk1,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -vnc :2 --enable-kvm -monitor stdio -m 4G -net nic,macaddr=00:23:32:45:89:10 -net tap,ifname=tap0,script=/etc/qemu-ifup,downscript=no -smp 4 -cpu Nehalem,+vmx -serial pty >>>>> >>>>> qemu cmd to run L2 - >>>>> # sudo qemu-system-x86_64 -hda VM/level2.img -vnc :0 --enable-kvm -monitor stdio -m 2G -smp 2 -cpu Nehalem -redir tcp:5555::22 >>>>> >>>>> Additionally, >>>>> L0 is FC19 with 3.16-rc3 >>>>> L1 and L2 are Ubuntu 14.04 with 3.13.0-24-generic >>>>> >>>>> Then start a kernel compilation inside L2 with "make -j3" >>>>> >>>>> There's no call trace on L0, both L0 and L1 are hung (or rather really slow) and >>>>> L1 serial spews out CPU softlock up errors. Enabling panic on softlockup on L1 will give >>>>> a trace with smp_call_function_many() I think the corresponding code in kernel/smp.c that >>>>> triggers this is >>>>> >>>>> WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() >>>>> && !oops_in_progress && !early_boot_irqs_disabled); >>>>> >>>>> I know in most cases this is usually harmless, but in this specific case, >>>>> it seems it's stuck here forever. >>>>> >>>>> Sorry, I don't have a L1 call trace handy atm, I can post that if you are interested. >>>>> >>>>> Note that this can take as much as 30 to 40 minutes to appear but once it does, >>>>> you will know because both L1 and L2 will be stuck with the serial messages as I mentioned >>>>> before. From my side, let me try this on another system to rule out any machine specific >>>>> weirdness going on.. >>>>> >>>> >>>> Thanks for your pointing out. >>>> >>>>> Please let me know if you need any further information. >>>>> >>>> >>>> I just run kvm-unit-tests w/ vmx.flat and eventinj.flat. >>>> >>>> >>>> w/ vmx.flat and w/o my patch applied >>>> >>>> [...] >>>> >>>> Test suite : interrupt >>>> FAIL: direct interrupt while running guest >>>> PASS: intercepted interrupt while running guest >>>> FAIL: direct interrupt + hlt >>>> FAIL: intercepted interrupt + hlt >>>> FAIL: direct interrupt + activity state hlt >>>> FAIL: intercepted interrupt + activity state hlt >>>> PASS: running a guest with interrupt acknowledgement set >>>> SUMMARY: 69 tests, 6 failures >>>> >>>> w/ vmx.flat and w/ my patch applied >>>> >>>> [...] >>>> >>>> Test suite : interrupt >>>> PASS: direct interrupt while running guest >>>> PASS: intercepted interrupt while running guest >>>> PASS: direct interrupt + hlt >>>> FAIL: intercepted interrupt + hlt >>>> PASS: direct interrupt + activity state hlt >>>> PASS: intercepted interrupt + activity state hlt >>>> PASS: running a guest with interrupt acknowledgement set >>>> >>>> SUMMARY: 69 tests, 2 failures >>> >>> Which version (hash) of kvm-unit-tests are you using? All tests up to >>> 307621765a are running fine here, but since a0e30e712d not much is >>> completing successfully anymore: >>> >> >> I just git pull my kvm-unit-tests to latest, the last commit is daeec9795d. >> >>> enabling apic >>> paging enabled >>> cr0 = 80010011 >>> cr3 = 7fff000 >>> cr4 = 20 >>> PASS: test vmxon with FEATURE_CONTROL cleared >>> PASS: test vmxon without FEATURE_CONTROL lock >>> PASS: test enable VMX in FEATURE_CONTROL >>> PASS: test FEATURE_CONTROL lock bit >>> PASS: test vmxon >>> FAIL: test vmptrld >>> PASS: test vmclear >>> init_vmcs : make_vmcs_current error >>> FAIL: test vmptrst >>> init_vmcs : make_vmcs_current error >>> vmx_run : vmlaunch failed. >>> FAIL: test vmlaunch >>> FAIL: test vmlaunch >>> >>> SUMMARY: 10 tests, 4 unexpected failures >> >> >> /opt/qemu/bin/qemu-system-x86_64 -enable-kvm -device pc-testdev -serial stdio >> -device isa-debug-exit,iobase=0xf4,iosize=0x4 -kernel ./x86/vmx.flat -cpu host >> >> Test suite : interrupt >> PASS: direct interrupt while running guest >> PASS: intercepted interrupt while running guest >> PASS: direct interrupt + hlt >> FAIL: intercepted interrupt + hlt >> PASS: direct interrupt + activity state hlt >> PASS: intercepted interrupt + activity state hlt >> PASS: running a guest with interrupt acknowledgement set >> >> SUMMARY: 69 tests, 2 failures > >Somehow I'm missing the other 31 vmx test we have now... Could you post >the full log? Please also post the output of qemu/scripts/kvm/vmxcap on >your test host to compare with what I have here. They are in attachment. Regards, Wanpeng Li > >Thanks, >Jan > >-- >Siemens AG, Corporate Technology, CT RTC ITP SES-DE >Corporate Competence Center Embedded Linux Basic VMX Information Revision 18 VMCS size 1024 VMCS restricted to 32 bit addresses no Dual-monitor support yes VMCS memory type 6 INS/OUTS instruction information yes IA32_VMX_TRUE_*_CTLS support yes pin-based controls External interrupt exiting yes NMI exiting yes Virtual NMIs yes Activate VMX-preemption timer yes Process posted interrupts no primary processor-based controls Interrupt window exiting yes Use TSC offsetting yes HLT exiting yes INVLPG exiting yes MWAIT exiting yes RDPMC exiting yes RDTSC exiting yes CR3-load exiting default CR3-store exiting default CR8-load exiting yes CR8-store exiting yes Use TPR shadow yes NMI-window exiting yes MOV-DR exiting yes Unconditional I/O exiting yes Use I/O bitmaps yes Monitor trap flag yes Use MSR bitmaps yes MONITOR exiting yes PAUSE exiting yes Activate secondary control yes secondary processor-based controls Virtualize APIC accesses yes Enable EPT yes Descriptor-table exiting yes Enable RDTSCP yes Virtualize x2APIC mode yes Enable VPID yes WBINVD exiting yes Unrestricted guest yes APIC register emulation no Virtual interrupt delivery no PAUSE-loop exiting yes RDRAND exiting yes Enable INVPCID yes Enable VM functions yes VMCS shadowing yes EPT-violation #VE no VM-Exit controls Save debug controls default Host address-space size yes Load IA32_PERF_GLOBAL_CTRL yes Acknowledge interrupt on exit yes Save IA32_PAT yes Load IA32_PAT yes Save IA32_EFER yes Load IA32_EFER yes Save VMX-preemption timer value yes VM-Entry controls Load debug controls default IA-64 mode guest yes Entry to SMM yes Deactivate dual-monitor treatment yes Load IA32_PERF_GLOBAL_CTRL yes Load IA32_PAT yes Load IA32_EFER yes Miscellaneous data VMX-preemption timer scale (log2) 5 Store EFER.LMA into IA-32e mode guest control yes HLT activity state yes Shutdown activity state yes Wait-for-SIPI activity state yes IA32_SMBASE support yes Number of CR3-target values 4 MSR-load/store count recommenation 0 IA32_SMM_MONITOR_CTL[2] can be set to 1 yes VMWRITE to VM-exit information fields yes MSEG revision identifier 0 VPID and EPT capabilities Execute-only EPT translations yes Page-walk length 4 yes Paging-structure memory type UC yes Paging-structure memory type WB yes 2MB EPT pages yes 1GB EPT pages yes INVEPT supported yes EPT accessed and dirty flags yes Single-context INVEPT yes All-context INVEPT yes INVVPID supported yes Individual-address INVVPID yes Single-context INVVPID yes All-context INVVPID yes Single-context-retaining-globals INVVPID yes VM Functions EPTP Switching yes enabling apic paging enabled cr0 = 80010011 cr3 = 7fff000 cr4 = 20 PASS: test vmxon with FEATURE_CONTROL cleared PASS: test vmxon without FEATURE_CONTROL lock PASS: test enable VMX in FEATURE_CONTROL PASS: test FEATURE_CONTROL lock bit PASS: test vmxon PASS: test vmptrld PASS: test vmclear PASS: test vmptrst PASS: test vmxoff Test suite : vmenter PASS: test vmlaunch PASS: test vmresume Test suite : preemption timer PASS: Keep preemption value PASS: Save preemption value PASS: busy-wait for preemption timer PASS: preemption timer during hlt PASS: preemption timer with 0 value Test suite : control field PAT PASS: Exit save PAT PASS: Exit load PAT PASS: Entry load PAT Test suite : control field EFER PASS: Exit save EFER PASS: Exit load EFER PASS: Entry load EFER Test suite : CR shadowing PASS: Read through CR0 PASS: Read through CR4 PASS: Write through CR0 PASS: Write through CR4 PASS: Read shadowing CR0 PASS: Read shadowing CR4 PASS: Write shadowing CR0 (same value) PASS: Write shadowing CR4 (same value) PASS: Write shadowing different X86_CR0_TS PASS: Write shadowing different X86_CR0_MP PASS: Write shadowing different X86_CR4_TSD PASS: Write shadowing different X86_CR4_DE Test suite : I/O bitmap PASS: I/O bitmap - I/O pass PASS: I/O bitmap - I/O width, byte PASS: I/O bitmap - I/O direction, in PASS: I/O bitmap - trap in PASS: I/O bitmap - I/O width, word PASS: I/O bitmap - I/O direction, out PASS: I/O bitmap - trap out PASS: I/O bitmap - I/O width, long PASS: I/O bitmap - I/O port, low part PASS: I/O bitmap - I/O port, high part PASS: I/O bitmap - partial pass PASS: I/O bitmap - overrun PASS: I/O bitmap - ignore unconditional exiting PASS: I/O bitmap - unconditional exiting Test suite : instruction intercept PASS: HLT PASS: INVLPG PASS: MWAIT PASS: RDPMC PASS: RDTSC PASS: MONITOR PASS: PAUSE PASS: WBINVD PASS: CPUID PASS: INVD Test suite : EPT framework PASS: EPT basic framework PASS: EPT misconfigurations PASS: EPT violation - page permission FAIL: EPT violation - paging structure Test suite : interrupt PASS: direct interrupt while running guest PASS: intercepted interrupt while running guest PASS: direct interrupt + hlt FAIL: intercepted interrupt + hlt PASS: direct interrupt + activity state hlt PASS: intercepted interrupt + activity state hlt `ASS: running a guest with interrupt acknowledgement set SUMMARY: 69 tests, 2 failures
Il 04/07/2014 07:43, Jan Kiszka ha scritto: > Which version (hash) of kvm-unit-tests are you using? All tests up to > 307621765a are running fine here, but since a0e30e712d not much is > completing successfully anymore: Which test failed to init and triggered the change in the patch? The patch was needed here to fix failures when KVM is loaded with ept=0. BTW, "FAIL: intercepted interrupt + hlt" is always failing here too (Xeon E5 Sandy Bridge). Paolo > enabling apic > paging enabled > cr0 = 80010011 > cr3 = 7fff000 > cr4 = 20 > PASS: test vmxon with FEATURE_CONTROL cleared > PASS: test vmxon without FEATURE_CONTROL lock > PASS: test enable VMX in FEATURE_CONTROL > PASS: test FEATURE_CONTROL lock bit > PASS: test vmxon > FAIL: test vmptrld > PASS: test vmclear > init_vmcs : make_vmcs_current error > FAIL: test vmptrst > init_vmcs : make_vmcs_current error > vmx_run : vmlaunch failed. > FAIL: test vmlaunch > FAIL: test vmlaunch > > SUMMARY: 10 tests, 4 unexpected failures > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Il 04/07/2014 09:39, Wanpeng Li ha scritto: > PASS: test vmxon with FEATURE_CONTROL cleared > PASS: test vmxon without FEATURE_CONTROL lock > PASS: test enable VMX in FEATURE_CONTROL > PASS: test FEATURE_CONTROL lock bit > PASS: test vmxon > PASS: test vmptrld > PASS: test vmclear > PASS: test vmptrst > PASS: test vmxoff You are not running the latest versions of the tests. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 04, 2014 at 09:46:38AM +0200, Paolo Bonzini wrote: >Il 04/07/2014 09:39, Wanpeng Li ha scritto: >>PASS: test vmxon with FEATURE_CONTROL cleared >>PASS: test vmxon without FEATURE_CONTROL lock >>PASS: test enable VMX in FEATURE_CONTROL >>PASS: test FEATURE_CONTROL lock bit >>PASS: test vmxon >>PASS: test vmptrld >>PASS: test vmclear >>PASS: test vmptrst >>PASS: test vmxoff > >You are not running the latest versions of the tests. > The last commit in my tree is commit daeec9795d3e6d4e9636588b6cb5fcd6e00d6d46 Author: Bandan Das <bsd@redhat.com> Date: Mon Jun 9 17:04:54 2014 -0400 VMX: Updated test_vmclear and test_vmptrld >Paolo >-- >To unsubscribe from this list: send the line "unsubscribe kvm" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Il 04/07/2014 09:59, Wanpeng Li ha scritto: >> >You are not running the latest versions of the tests. >> > > The last commit in my tree is > > commit daeec9795d3e6d4e9636588b6cb5fcd6e00d6d46 > Author: Bandan Das <bsd@redhat.com> > Date: Mon Jun 9 17:04:54 2014 -0400 > > VMX: Updated test_vmclear and test_vmptrld Can you try recompiling x86/vmx.*? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 04, 2014 at 10:14:34AM +0200, Paolo Bonzini wrote: >Il 04/07/2014 09:59, Wanpeng Li ha scritto: >>>>You are not running the latest versions of the tests. >>>> >>The last commit in my tree is >> >>commit daeec9795d3e6d4e9636588b6cb5fcd6e00d6d46 >>Author: Bandan Das <bsd@redhat.com> >>Date: Mon Jun 9 17:04:54 2014 -0400 >> >> VMX: Updated test_vmclear and test_vmptrld > >Can you try recompiling x86/vmx.*? > My fault, you are right. Regards, Wanpeng Li >Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2014-07-04 07:43, Jan Kiszka wrote: > All tests up to > 307621765a are running fine here, but since a0e30e712d not much is > completing successfully anymore: > > enabling apic > paging enabled > cr0 = 80010011 > cr3 = 7fff000 > cr4 = 20 > PASS: test vmxon with FEATURE_CONTROL cleared > PASS: test vmxon without FEATURE_CONTROL lock > PASS: test enable VMX in FEATURE_CONTROL > PASS: test FEATURE_CONTROL lock bit > PASS: test vmxon > FAIL: test vmptrld > PASS: test vmclear > init_vmcs : make_vmcs_current error > FAIL: test vmptrst > init_vmcs : make_vmcs_current error > vmx_run : vmlaunch failed. > FAIL: test vmlaunch > FAIL: test vmlaunch > > SUMMARY: 10 tests, 4 unexpected failures Here is the reason for my failures: 000000000000010f <make_vmcs_current>: 10f: 48 89 7c 24 f8 mov %rdi,-0x8(%rsp) 114: 9c pushfq 115: 58 pop %rax 116: 48 83 c8 41 or $0x41,%rax 11a: 50 push %rax 11b: 9d popfq 11c: 0f c7 74 24 f8 vmptrld -0x8(%rsp) 121: 0f 96 c0 setbe %al 124: 0f b6 c0 movzbl %al,%eax 127: c3 retq The compiler is not aware of the fact that push/pop exists in this function and, thus, places the vmcs parameter on the stack without reserving the space. So the pushfq will overwrite the vmcs pointer and let the function fail. Jan
Il 04/07/2014 11:33, Jan Kiszka ha scritto: > > The compiler is not aware of the fact that push/pop exists in this > function and, thus, places the vmcs parameter on the stack without > reserving the space. So the pushfq will overwrite the vmcs pointer and > let the function fail. Is that just a missing "memory" clobber? push/pop clobbers memory. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2014-07-04 11:38, Paolo Bonzini wrote: > Il 04/07/2014 11:33, Jan Kiszka ha scritto: >> >> The compiler is not aware of the fact that push/pop exists in this >> function and, thus, places the vmcs parameter on the stack without >> reserving the space. So the pushfq will overwrite the vmcs pointer and >> let the function fail. > > Is that just a missing "memory" clobber? push/pop clobbers memory. Nope, we would needs some clobber like "stack". I wonder what is required to use push in inline assembly safely? Jan
On 2014-07-04 12:52, Jan Kiszka wrote: > On 2014-07-04 11:38, Paolo Bonzini wrote: >> Il 04/07/2014 11:33, Jan Kiszka ha scritto: >>> >>> The compiler is not aware of the fact that push/pop exists in this >>> function and, thus, places the vmcs parameter on the stack without >>> reserving the space. So the pushfq will overwrite the vmcs pointer and >>> let the function fail. >> >> Is that just a missing "memory" clobber? push/pop clobbers memory. > > Nope, we would needs some clobber like "stack". I wonder what is > required to use push in inline assembly safely? My colleague just found the answer: -mno-red-zone is required for 64-bit in order to play freely with the stack (or you need to stay off that zone, apparently some 128 bytes below the stack pointer). The kernel sets that switch, our unit tests do not. Jan
Il 04/07/2014 13:07, Jan Kiszka ha scritto: > On 2014-07-04 12:52, Jan Kiszka wrote: >> On 2014-07-04 11:38, Paolo Bonzini wrote: >>> Il 04/07/2014 11:33, Jan Kiszka ha scritto: >>>> >>>> The compiler is not aware of the fact that push/pop exists in this >>>> function and, thus, places the vmcs parameter on the stack without >>>> reserving the space. So the pushfq will overwrite the vmcs pointer and >>>> let the function fail. >>> >>> Is that just a missing "memory" clobber? push/pop clobbers memory. >> >> Nope, we would needs some clobber like "stack". I wonder what is >> required to use push in inline assembly safely? > > My colleague just found the answer: -mno-red-zone is required for 64-bit > in order to play freely with the stack (or you need to stay off that > zone, apparently some 128 bytes below the stack pointer). The kernel > sets that switch, our unit tests do not. Are you posting a patch? (Also, what compiler is that?) Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f4e5aed..fe69c49 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -372,6 +372,7 @@ struct nested_vmx { u64 vmcs01_tsc_offset; /* L2 must run next, and mustn't decide to exit to L1. */ bool nested_run_pending; + bool l1_events_blocked; /* * Guest pages referred to in vmcs02 with host-physical pointers, so * we must keep them pinned while L2 runs. @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) * we did not inject a still-pending event to L1 now because of * nested_run_pending, we need to re-enable this bit. */ - if (vmx->nested.nested_run_pending) + if (to_vmx(vcpu)->nested.l1_events_blocked) { + to_vmx(vcpu)->nested.l1_events_blocked = false; kvm_make_request(KVM_REQ_EVENT, vcpu); + } vmx->nested.nested_run_pending = 0; @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) && vmx->nested.preemption_timer_expired) { - if (vmx->nested.nested_run_pending) + if (vmx->nested.nested_run_pending) { + vmx->nested.l1_events_blocked = true; return -EBUSY; + } nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); return 0; } if (vcpu->arch.nmi_pending && nested_exit_on_nmi(vcpu)) { - if (vmx->nested.nested_run_pending || - vcpu->arch.interrupt.pending) + if (vmx->nested.nested_run_pending) { + vmx->nested.l1_events_blocked = true; + return -EBUSY; + } + if (vcpu->arch.interrupt.pending) return -EBUSY; nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, NMI_VECTOR | INTR_TYPE_NMI_INTR | @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) if ((kvm_cpu_has_interrupt(vcpu) || external_intr) && nested_exit_on_intr(vcpu)) { - if (vmx->nested.nested_run_pending) + if (vmx->nested.nested_run_pending) { + vmx->nested.l1_events_blocked = true; return -EBUSY; + } nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); }
This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 If we didn't inject a still-pending event to L1 since nested_run_pending, KVM_REQ_EVENT should be requested after the vmexit in order to inject the event to L1. However, current log blindly request a KVM_REQ_EVENT even if there is no still-pending event to L1 which blocked by nested_run_pending. There is a race which lead to an interrupt will be injected to L2 which belong to L1 if L0 send an interrupt to L1 during this window. VCPU0 another thread L1 intr not blocked on L2 first entry vmx_vcpu_run req event kvm check request req event check_nested_events don't have any intr not nested exit intr occur (8254, lapic timer etc) inject_pending_event now have intr inject interrupt This patch fix this race by introduced a l1_events_blocked field in nested_vmx which indicates there is still-pending event which blocked by nested_run_pending, and smart request a KVM_REQ_EVENT if there is a still-pending event which blocked by nested_run_pending. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> --- arch/x86/kvm/vmx.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-)