diff mbox

[RFC] x86/apicv: fix RTC periodic timer and apicv issue

Message ID E0A769A898ADB6449596C41F51EF62C6AA3E8E@SZXEMI506-MBX.china.huawei.com (mailing list archive)
State New, archived
Headers show

Commit Message

Xuquan (Euler) Sept. 9, 2016, 3:02 a.m. UTC
On August 30, 2016 1:58 PM, Tian Kevin < kevin.tian@intel.com > wrote:
>> From: Xuquan (Euler) [mailto:xuquan8@huawei.com]
>> Sent: Friday, August 19, 2016 8:59 PM
>>
>> When Xen apicv is enabled, wall clock time is faster on Windows7-32
>> guest with high payload (with 2vCPU, captured from xentrace, in high
>> payload, the count of IPI interrupt increases rapidly between these vCPUs).
>>
>> If IPI intrrupt (vector 0xe1) and periodic timer interrupt (vector
>> 0xd1) are both pending (index of bit set in VIRR), unfortunately, the
>> IPI intrrupt is high priority than periodic timer interrupt. Xen
>> updates IPI interrupt bit set in VIRR to guest interrupt status (RVI)
>> as a high priority and apicv (Virtual-Interrupt Delivery) delivers IPI
>> interrupt within VMX non-root operation without a VM exit. Within VMX
>> non-root operation, if periodic timer interrupt index of bit is set in
>> VIRR and highest, the apicv delivers periodic timer interrupt within VMX
>non-root operation as well.
>>
>> But in current code, if Xen doesn't update periodic timer interrupt
>> bit set in VIRR to guest interrupt status (RVI) directly, Xen is not
>> aware of this case to decrease the count (pending_intr_nr) of pending
>> periodic timer interrupt, then Xen will deliver a periodic timer
>> interrupt again. The guest receives more periodic timer interrupt.
>>
>> If the periodic timer interrut is delivered and not the highest
>> priority, make Xen be aware of this case to decrease the count of
>> pending periodic timer interrupt.
>>
>> Signed-off-by: Yifei Jiang <jiangyifei@huawei.com>
>> Signed-off-by: Rongguang He <herongguang.he@huawei.com>
>> Signed-off-by: Quan Xu <xuquan8@huawei.com>
>> ---
>> Why RFC:
>> 1. I am not quite sure for other cases, such as nested case.
>> 2. Within VMX non-root operation, an Asynchronous Enclave Exit (including
>external
>>    interrupts, non-maskable interrupt system-management interrrupts,
>exceptions
>>    and VM exit) may occur before delivery of a periodic timer interrupt, the
>periodic
>>    timer interrupt may be lost when a coming periodic timer interrupt is
>delivered.
>>    Actually, and so current code is.
>> ---
>>  xen/arch/x86/hvm/vmx/intr.c | 16 +++++++++++++++-
>>  1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c
>> index 8fca08c..d3a034e 100644
>> --- a/xen/arch/x86/hvm/vmx/intr.c
>> +++ b/xen/arch/x86/hvm/vmx/intr.c
>> @@ -334,7 +334,21 @@ void vmx_intr_assist(void)
>>              __vmwrite(EOI_EXIT_BITMAP(i),
>v->arch.hvm_vmx.eoi_exit_bitmap[i]);
>>          }
>>
>> -        pt_intr_post(v, intack);
>> +       /*
>> +        * If the periodic timer interrut is delivered and not the highest
>priority,
>> +        * make Xen be aware of this case to decrease the count of pending
>periodic
>> +        * timer interrupt.
>> +        */
>> +        if ( pt_vector != -1 && intack.vector > pt_vector )
>> +        {
>> +            struct hvm_intack pt_intack;
>> +
>> +            pt_intack.vector = pt_vector;
>> +            pt_intack.source = hvm_intsrc_lapic;
>> +            pt_intr_post(v, pt_intack);
>> +        }
>> +        else
>> +            pt_intr_post(v, intack);
>>      }
>
>Here is my thought. w/o above patch one pending pt interrupt may result in
>multiple injections of guest timer interrupt, as you identified, when pt_vector
>happens to be not the highest pending vector. Then w/ your patch, multiple
>pending pt interrupt instances may be combined as one injection of guest timer
>interrupt. Especially intack.vector
>>pt_vector doesn't mean pt_intr is eligible for injection, which might
>be blocked due to other reasons. As Jan pointed out, this path is very fragile, so
>better we can have a fix to sustain the original behavior with one pending pt
>instance causing one actual injection.
>
>What about doing pt_intr_post in EOI-induced VM-exit instead? EOI exit bit for
>pt_intr is already set to 1 which means every injection would incur an
>EOI-induced VM-exit:
>
>       /*
>        * Set eoi_exit_bitmap for periodic timer interrup to cause EOI-induced
>VM
>        * exit, then pending periodic time interrups have the chance to be
>injected
>        * for compensation
>        */
>        if (pt_vector != -1)
>            vmx_set_eoi_exit_bitmap(v, pt_vector);
>
>I don't think delaying post makes much difference here. Even we do post earlier
>like your patch, further pending instances have to wait until current instance is
>completed. So as long as post happens before EOI is completed, it should be
>fine.

Kevin, I verify your suggestion with below modification. wall clock time is __still__ faster. I hope this modification is correct to your suggestion.

I __guess__ that the vm-entry is more frequent than PT interrupt,
Specifically, if there is a PT interrupt pending, the count (pending_intr_nr) is 1..

before next PT interrupt coming, each PT interrupt injection may not incur an EOI-induced VM-exit directly, multiple PT interrupt inject for a pending PT interrupt,
then multiple EOI-induced VM-exit incur with multiple pt_intr_post() to decrease the count (pending_intr_nr), but the count (pending_intr_nr) is still 1 before next PT interrupt coming, then only one pt_intr_post() is effective..

Thanks for your time!!
Quan

Comments

Tian, Kevin Sept. 12, 2016, 7:57 a.m. UTC | #1
> From: Xuquan (Euler) [mailto:xuquan8@huawei.com]
> Sent: Friday, September 09, 2016 11:02 AM
> 
> On August 30, 2016 1:58 PM, Tian Kevin < kevin.tian@intel.com > wrote:
> >> From: Xuquan (Euler) [mailto:xuquan8@huawei.com]
> >> Sent: Friday, August 19, 2016 8:59 PM
> >>
> >> When Xen apicv is enabled, wall clock time is faster on Windows7-32
> >> guest with high payload (with 2vCPU, captured from xentrace, in high
> >> payload, the count of IPI interrupt increases rapidly between these vCPUs).
> >>
> >> If IPI intrrupt (vector 0xe1) and periodic timer interrupt (vector
> >> 0xd1) are both pending (index of bit set in VIRR), unfortunately, the
> >> IPI intrrupt is high priority than periodic timer interrupt. Xen
> >> updates IPI interrupt bit set in VIRR to guest interrupt status (RVI)
> >> as a high priority and apicv (Virtual-Interrupt Delivery) delivers IPI
> >> interrupt within VMX non-root operation without a VM exit. Within VMX
> >> non-root operation, if periodic timer interrupt index of bit is set in
> >> VIRR and highest, the apicv delivers periodic timer interrupt within VMX
> >non-root operation as well.
> >>
> >> But in current code, if Xen doesn't update periodic timer interrupt
> >> bit set in VIRR to guest interrupt status (RVI) directly, Xen is not
> >> aware of this case to decrease the count (pending_intr_nr) of pending
> >> periodic timer interrupt, then Xen will deliver a periodic timer
> >> interrupt again. The guest receives more periodic timer interrupt.
> >>
> >> If the periodic timer interrut is delivered and not the highest
> >> priority, make Xen be aware of this case to decrease the count of
> >> pending periodic timer interrupt.
> >>
> >> Signed-off-by: Yifei Jiang <jiangyifei@huawei.com>
> >> Signed-off-by: Rongguang He <herongguang.he@huawei.com>
> >> Signed-off-by: Quan Xu <xuquan8@huawei.com>
> >> ---
> >> Why RFC:
> >> 1. I am not quite sure for other cases, such as nested case.
> >> 2. Within VMX non-root operation, an Asynchronous Enclave Exit (including
> >external
> >>    interrupts, non-maskable interrupt system-management interrrupts,
> >exceptions
> >>    and VM exit) may occur before delivery of a periodic timer interrupt, the
> >periodic
> >>    timer interrupt may be lost when a coming periodic timer interrupt is
> >delivered.
> >>    Actually, and so current code is.
> >> ---
> >>  xen/arch/x86/hvm/vmx/intr.c | 16 +++++++++++++++-
> >>  1 file changed, 15 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c
> >> index 8fca08c..d3a034e 100644
> >> --- a/xen/arch/x86/hvm/vmx/intr.c
> >> +++ b/xen/arch/x86/hvm/vmx/intr.c
> >> @@ -334,7 +334,21 @@ void vmx_intr_assist(void)
> >>              __vmwrite(EOI_EXIT_BITMAP(i),
> >v->arch.hvm_vmx.eoi_exit_bitmap[i]);
> >>          }
> >>
> >> -        pt_intr_post(v, intack);
> >> +       /*
> >> +        * If the periodic timer interrut is delivered and not the highest
> >priority,
> >> +        * make Xen be aware of this case to decrease the count of pending
> >periodic
> >> +        * timer interrupt.
> >> +        */
> >> +        if ( pt_vector != -1 && intack.vector > pt_vector )
> >> +        {
> >> +            struct hvm_intack pt_intack;
> >> +
> >> +            pt_intack.vector = pt_vector;
> >> +            pt_intack.source = hvm_intsrc_lapic;
> >> +            pt_intr_post(v, pt_intack);
> >> +        }
> >> +        else
> >> +            pt_intr_post(v, intack);
> >>      }
> >
> >Here is my thought. w/o above patch one pending pt interrupt may result in
> >multiple injections of guest timer interrupt, as you identified, when pt_vector
> >happens to be not the highest pending vector. Then w/ your patch, multiple
> >pending pt interrupt instances may be combined as one injection of guest timer
> >interrupt. Especially intack.vector
> >>pt_vector doesn't mean pt_intr is eligible for injection, which might
> >be blocked due to other reasons. As Jan pointed out, this path is very fragile, so
> >better we can have a fix to sustain the original behavior with one pending pt
> >instance causing one actual injection.
> >
> >What about doing pt_intr_post in EOI-induced VM-exit instead? EOI exit bit for
> >pt_intr is already set to 1 which means every injection would incur an
> >EOI-induced VM-exit:
> >
> >       /*
> >        * Set eoi_exit_bitmap for periodic timer interrup to cause EOI-induced
> >VM
> >        * exit, then pending periodic time interrups have the chance to be
> >injected
> >        * for compensation
> >        */
> >        if (pt_vector != -1)
> >            vmx_set_eoi_exit_bitmap(v, pt_vector);
> >
> >I don't think delaying post makes much difference here. Even we do post earlier
> >like your patch, further pending instances have to wait until current instance is
> >completed. So as long as post happens before EOI is completed, it should be
> >fine.
> 
> Kevin, I verify your suggestion with below modification. wall clock time is __still__ faster.
> I hope this modification is correct to your suggestion.
> 
> I __guess__ that the vm-entry is more frequent than PT interrupt,
> Specifically, if there is a PT interrupt pending, the count (pending_intr_nr) is 1..
> 
> before next PT interrupt coming, each PT interrupt injection may not incur an EOI-induced
> VM-exit directly, multiple PT interrupt inject for a pending PT interrupt,
> then multiple EOI-induced VM-exit incur with multiple pt_intr_post() to decrease the count
> (pending_intr_nr), but the count (pending_intr_nr) is still 1 before next PT interrupt coming,
> then only one pt_intr_post() is effective..

I don't quite understand your description here, but just for your patch see below...

> 
> Thanks for your time!!
> Quan
> 
> ======= modification ========
> 
> diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
> index 1d5d287..cc247c3 100644
> --- a/xen/arch/x86/hvm/vlapic.c
> +++ b/xen/arch/x86/hvm/vlapic.c
> @@ -433,6 +433,11 @@ void vlapic_EOI_set(struct vlapic *vlapic)
>  void vlapic_handle_EOI(struct vlapic *vlapic, u8 vector)
>  {
>      struct domain *d = vlapic_domain(vlapic);
> +    struct hvm_intack pt_intack;
> +
> +    pt_intack.vector = vector;
> +    pt_intack.source = hvm_intsrc_lapic;
> +    pt_intr_post(vlapic_vcpu(vlapic), pt_intack);
> 
>      if ( vlapic_test_and_clear_vector(vector, &vlapic->regs->data[APIC_TMR]) )
>          vioapic_update_EOI(d, vector);
> diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c
> index 8fca08c..29d9bbf 100644
> --- a/xen/arch/x86/hvm/vmx/intr.c
> +++ b/xen/arch/x86/hvm/vmx/intr.c
> @@ -333,8 +333,6 @@ void vmx_intr_assist(void)
>              clear_bit(i, &v->arch.hvm_vmx.eoi_exitmap_changed);
>              __vmwrite(EOI_EXIT_BITMAP(i), v->arch.hvm_vmx.eoi_exit_bitmap[i]);
>          }
> -
> -        pt_intr_post(v, intack);
>      }
>      else
>      {
> 

Because we update pt irq in every vmentry, there is a chance that 
already-injected instance (before EOI-induced exit happens) will 
incur another pending IRR setting if there is a VM-exit happens 
between HW virtual interrupt injection (vIRR->0, vISR->1) and 
EOI-induced exit (vISR->0), since pt_intr_post hasn't been invoked
yet. I guess this is the reason why you still see faster wallclock.

I think you need mark this pending_intr_post situation explicitly. 
Then pt_update_irq should skip such pt timer when pending_intr_post 
of that timer is true (otherwise the update is meaningless since 
previous one hasn't been posted yet). Then with your change to post 
in EOI-induced exit handler, it should work correctly to meet the goal 
- one virtual interrupt delivery for one pending pt intr...

Thanks
Kevin
Xuquan (Euler) Sept. 12, 2016, 9:07 a.m. UTC | #2
On September 12, 2016 3:58 PM, Tian, Kevin <kevin.tian@intel.com> wrote:
>> From: Xuquan (Euler) [mailto:xuquan8@huawei.com]
>> Sent: Friday, September 09, 2016 11:02 AM
>>
>> On August 30, 2016 1:58 PM, Tian Kevin < kevin.tian@intel.com > wrote:
>> >> From: Xuquan (Euler) [mailto:xuquan8@huawei.com]
>> >> Sent: Friday, August 19, 2016 8:59 PM
>> diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
>> index 1d5d287..cc247c3 100644
>> --- a/xen/arch/x86/hvm/vlapic.c
>> +++ b/xen/arch/x86/hvm/vlapic.c
>> @@ -433,6 +433,11 @@ void vlapic_EOI_set(struct vlapic *vlapic)  void
>> vlapic_handle_EOI(struct vlapic *vlapic, u8 vector)  {
>>      struct domain *d = vlapic_domain(vlapic);
>> +    struct hvm_intack pt_intack;
>> +
>> +    pt_intack.vector = vector;
>> +    pt_intack.source = hvm_intsrc_lapic;
>> +    pt_intr_post(vlapic_vcpu(vlapic), pt_intack);
>>
>>      if ( vlapic_test_and_clear_vector(vector,
>&vlapic->regs->data[APIC_TMR]) )
>>          vioapic_update_EOI(d, vector); diff --git
>> a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c index
>> 8fca08c..29d9bbf 100644
>> --- a/xen/arch/x86/hvm/vmx/intr.c
>> +++ b/xen/arch/x86/hvm/vmx/intr.c
>> @@ -333,8 +333,6 @@ void vmx_intr_assist(void)
>>              clear_bit(i, &v->arch.hvm_vmx.eoi_exitmap_changed);
>>              __vmwrite(EOI_EXIT_BITMAP(i),
>v->arch.hvm_vmx.eoi_exit_bitmap[i]);
>>          }
>> -
>> -        pt_intr_post(v, intack);
>>      }
>>      else
>>      {
>>
>
>Because we update pt irq in every vmentry, there is a chance that
>already-injected instance (before EOI-induced exit happens) will incur another
>pending IRR setting if there is a VM-exit happens between HW virtual interrupt
>injection (vIRR->0, vISR->1) and EOI-induced exit (vISR->0), since pt_intr_post
>hasn't been invoked yet. I guess this is the reason why you still see faster
>wallclock.
>

Agreed. A good description. My bad description is from another aspect.

>I think you need mark this pending_intr_post situation explicitly.
>Then pt_update_irq should skip such pt timer when pending_intr_post of that
>timer is true (otherwise the update is meaningless since previous one hasn't
>been posted yet). Then with your change to post in EOI-induced exit handler, it
>should work correctly to meet the goal
>- one virtual interrupt delivery for one pending pt intr...
>
I think we are at least on the right track.
But I can't follow ' pending_intr_post ', a new parameter? Thanks.


Quan
Tian, Kevin Sept. 19, 2016, 9:25 a.m. UTC | #3
> From: Xuquan (Euler) [mailto:xuquan8@huawei.com]
> Sent: Monday, September 12, 2016 5:08 PM
> 
> On September 12, 2016 3:58 PM, Tian, Kevin <kevin.tian@intel.com> wrote:
> >> From: Xuquan (Euler) [mailto:xuquan8@huawei.com]
> >> Sent: Friday, September 09, 2016 11:02 AM
> >>
> >> On August 30, 2016 1:58 PM, Tian Kevin < kevin.tian@intel.com > wrote:
> >> >> From: Xuquan (Euler) [mailto:xuquan8@huawei.com]
> >> >> Sent: Friday, August 19, 2016 8:59 PM
> >> diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
> >> index 1d5d287..cc247c3 100644
> >> --- a/xen/arch/x86/hvm/vlapic.c
> >> +++ b/xen/arch/x86/hvm/vlapic.c
> >> @@ -433,6 +433,11 @@ void vlapic_EOI_set(struct vlapic *vlapic)  void
> >> vlapic_handle_EOI(struct vlapic *vlapic, u8 vector)  {
> >>      struct domain *d = vlapic_domain(vlapic);
> >> +    struct hvm_intack pt_intack;
> >> +
> >> +    pt_intack.vector = vector;
> >> +    pt_intack.source = hvm_intsrc_lapic;
> >> +    pt_intr_post(vlapic_vcpu(vlapic), pt_intack);
> >>
> >>      if ( vlapic_test_and_clear_vector(vector,
> >&vlapic->regs->data[APIC_TMR]) )
> >>          vioapic_update_EOI(d, vector); diff --git
> >> a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c index
> >> 8fca08c..29d9bbf 100644
> >> --- a/xen/arch/x86/hvm/vmx/intr.c
> >> +++ b/xen/arch/x86/hvm/vmx/intr.c
> >> @@ -333,8 +333,6 @@ void vmx_intr_assist(void)
> >>              clear_bit(i, &v->arch.hvm_vmx.eoi_exitmap_changed);
> >>              __vmwrite(EOI_EXIT_BITMAP(i),
> >v->arch.hvm_vmx.eoi_exit_bitmap[i]);
> >>          }
> >> -
> >> -        pt_intr_post(v, intack);
> >>      }
> >>      else
> >>      {
> >>
> >
> >Because we update pt irq in every vmentry, there is a chance that
> >already-injected instance (before EOI-induced exit happens) will incur another
> >pending IRR setting if there is a VM-exit happens between HW virtual interrupt
> >injection (vIRR->0, vISR->1) and EOI-induced exit (vISR->0), since pt_intr_post
> >hasn't been invoked yet. I guess this is the reason why you still see faster
> >wallclock.
> >
> 
> Agreed. A good description. My bad description is from another aspect.
> 
> >I think you need mark this pending_intr_post situation explicitly.
> >Then pt_update_irq should skip such pt timer when pending_intr_post of that
> >timer is true (otherwise the update is meaningless since previous one hasn't
> >been posted yet). Then with your change to post in EOI-induced exit handler, it
> >should work correctly to meet the goal
> >- one virtual interrupt delivery for one pending pt intr...
> >
> I think we are at least on the right track.
> But I can't follow ' pending_intr_post ', a new parameter? Thanks.
> 
> 

yes, a new parameter to record whether a intr_post operation is pending

Thanks
Kevin
diff mbox

Patch

======= modification ========

diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 1d5d287..cc247c3 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -433,6 +433,11 @@  void vlapic_EOI_set(struct vlapic *vlapic)
 void vlapic_handle_EOI(struct vlapic *vlapic, u8 vector)
 {
     struct domain *d = vlapic_domain(vlapic);
+    struct hvm_intack pt_intack;
+
+    pt_intack.vector = vector;
+    pt_intack.source = hvm_intsrc_lapic;
+    pt_intr_post(vlapic_vcpu(vlapic), pt_intack);

     if ( vlapic_test_and_clear_vector(vector, &vlapic->regs->data[APIC_TMR]) )
         vioapic_update_EOI(d, vector);
diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c
index 8fca08c..29d9bbf 100644
--- a/xen/arch/x86/hvm/vmx/intr.c
+++ b/xen/arch/x86/hvm/vmx/intr.c
@@ -333,8 +333,6 @@  void vmx_intr_assist(void)
             clear_bit(i, &v->arch.hvm_vmx.eoi_exitmap_changed);
             __vmwrite(EOI_EXIT_BITMAP(i), v->arch.hvm_vmx.eoi_exit_bitmap[i]);
         }
-
-        pt_intr_post(v, intack);
     }
     else
     {