mbox series

[RFC,0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM

Message ID 20210421150831.60133-1-kentaishiguro@sslab.ics.keio.ac.jp (mailing list archive)
Headers show
Series Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM | expand

Message

Kenta Ishiguro April 21, 2021, 3:08 p.m. UTC
Dear KVM developers and maintainers,

In our research work presented last week at the VEE 2021 conference [1], we
found out that a lot of continuous Pause-Loop-Exiting (PLE) events occur
due to three problems we have identified: 1) Linux CFS ignores hints from
KVM; 2) IPI receiver vCPUs in user-mode are not boosted; 3) IPI-receiver
that has halted is always a candidate for boost.  We have intoduced two
mitigations against the problems.

To solve problem (1), patch 1 increases the vruntime of yielded vCPU to
pass the check `if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next,
left) < 1)` in `struct sched_entity * pick_next_entity()` if the cfs_rq's
skip and next are both vCPUs in the same VM. To keep fairness it does not
prioritize the guest VM which causes PLE, however it improves the
performance by eliminating unnecessary PLE. Also we have confirmed
`yield_to_task_fair` is called only from KVM.

To solve problems (2) and (3), patch 2 monitors IPI communication between
vCPUs and leverages the relationship between vCPUs to select boost
candidates.  The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
delivering interrupt" patch
(https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
seems to be effective for (2) while it only uses the IPI receiver
information.

Our approach reduces the total number of PLE events by up to 87.6 % in four
8-vCPU VMs in over-subscribed scenario with the Linux kernel 5.6.0. Please
find the patch below.

We would greatly appreciate your valuable feedback on our approach and
patch.

Thank you very much for your consideration
Kenta Ishiguro

[1] Kenta Ishiguro, Naoki Yasuno, Pierre-Louis Aublin, and Kenji Kono.
    "Mitigating excessive vCPU spinning in VM-agnostic KVM".
    In Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference
    on Virtual Execution Environments (VEE 2021).
    Association for Computing Machinery, New York,
    NY, USA, 139--152.  https://dl.acm.org/doi/abs/10.1145/3453933.3454020

Kenta Ishiguro (2):
  Prevent CFS from ignoring boost requests from KVM
  Boost vCPUs based on IPI-sender and receiver information

 arch/x86/kvm/lapic.c     | 14 ++++++++++++++
 arch/x86/kvm/vmx/vmx.c   |  2 ++
 include/linux/kvm_host.h |  5 +++++
 kernel/sched/fair.c      | 31 +++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c      | 26 ++++++++++++++++++++++++--
 5 files changed, 76 insertions(+), 2 deletions(-)

Comments

Sean Christopherson April 21, 2021, 4:19 p.m. UTC | #1
On Thu, Apr 22, 2021, Kenta Ishiguro wrote:
> To solve problems (2) and (3), patch 2 monitors IPI communication between
> vCPUs and leverages the relationship between vCPUs to select boost
> candidates.  The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
> delivering interrupt" patch
> (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
> seems to be effective for (2) while it only uses the IPI receiver
> information.

On the IPI side of thing, I like the idea of explicitly tracking the IPIs,
especially if we can simplify the implementation, e.g. by losing the receiver
info and making ipi_received a bool.  Maybe temporarily table Wanpeng's patch
while this approach is analyzed?
Wanpeng Li April 22, 2021, 12:55 a.m. UTC | #2
On Thu, 22 Apr 2021 at 06:13, Kenta Ishiguro
<kentaishiguro@sslab.ics.keio.ac.jp> wrote:
>
> Dear KVM developers and maintainers,
>
> In our research work presented last week at the VEE 2021 conference [1], we
> found out that a lot of continuous Pause-Loop-Exiting (PLE) events occur
> due to three problems we have identified: 1) Linux CFS ignores hints from
> KVM; 2) IPI receiver vCPUs in user-mode are not boosted; 3) IPI-receiver
> that has halted is always a candidate for boost.  We have intoduced two
> mitigations against the problems.
>
> To solve problem (1), patch 1 increases the vruntime of yielded vCPU to
> pass the check `if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next,
> left) < 1)` in `struct sched_entity * pick_next_entity()` if the cfs_rq's
> skip and next are both vCPUs in the same VM. To keep fairness it does not
> prioritize the guest VM which causes PLE, however it improves the
> performance by eliminating unnecessary PLE. Also we have confirmed
> `yield_to_task_fair` is called only from KVM.
>
> To solve problems (2) and (3), patch 2 monitors IPI communication between
> vCPUs and leverages the relationship between vCPUs to select boost
> candidates.  The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
> delivering interrupt" patch
> (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
> seems to be effective for (2) while it only uses the IPI receiver
> information.
>
> Our approach reduces the total number of PLE events by up to 87.6 % in four
> 8-vCPU VMs in over-subscribed scenario with the Linux kernel 5.6.0. Please
> find the patch below.

You should mention that this improvement mainly comes from your
problems (1) scheduler hacking, however, kvm task is just an ordinary
task and scheduler maintainer always does not accept special
treatment.  the worst case for problems (1) mentioned in your paper, I
guess it is vCPU stacking issue, I try to mitigate it before
(https://lore.kernel.org/kvm/1564479235-25074-1-git-send-email-wanpengli@tencent.com/).
For your problems (3), we evaluate hackbench which is heavily
contended rq locks and heavy async ipi(reschedule ipi), the async ipi
influence is around 0.X%, I don't expect normal workloads can feel any
affected. In addition, four 8-vCPU VMs are not suitable for
scalability evaluation. I don't think the complex which is introduced
by your patch 2 is worth it since it gets a similar effect as my
version w/ current heuristic algorithm.

    Wanpeng
Wanpeng Li April 22, 2021, 12:21 p.m. UTC | #3
On Thu, 22 Apr 2021 at 09:45, Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Apr 22, 2021, Kenta Ishiguro wrote:
> > To solve problems (2) and (3), patch 2 monitors IPI communication between
> > vCPUs and leverages the relationship between vCPUs to select boost
> > candidates.  The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
> > delivering interrupt" patch
> > (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
> > seems to be effective for (2) while it only uses the IPI receiver
> > information.
>
> On the IPI side of thing, I like the idea of explicitly tracking the IPIs,
> especially if we can simplify the implementation, e.g. by losing the receiver
> info and making ipi_received a bool.  Maybe temporarily table Wanpeng's patch
> while this approach is analyzed?

Hi all,

I evaluate my patch
(https://lore.kernel.org/kvm/1618542490-14756-1-git-send-email-wanpengli@tencent.com),
Kenta's patch 2 and Sean's suggestion. The testing environment is
pbzip2 in 96 vCPUs VM in over-subscribe scenario (The host machine is
2 socket, 48 cores, 96 HTs Intel CLX box). Note: the Kenta's scheduler
hacking is not applied. The score of my patch is the most stable and
the best performance.

Wanpeng's patch

The average: vanilla -> boost: 69.124 -> 61.975, 10.3%

* Wall Clock: 61.695359 seconds
* Wall Clock: 63.343579 seconds
* Wall Clock: 61.567513 seconds
* Wall Clock: 62.144722 seconds
* Wall Clock: 61.091442 seconds
* Wall Clock: 62.085912 seconds
* Wall Clock: 61.311954 seconds

Kenta' patch

The average: vanilla -> boost: 69.148 -> 64.567, 6.6%

* Wall Clock:  66.288113 seconds
* Wall Clock:  61.228642 seconds
* Wall Clock:  62.100524 seconds
* Wall Clock:  68.355473 seconds
* Wall Clock:  64.864608 seconds

Sean's suggestion:

The average: vanilla -> boost: 69.148 -> 66.505, 3.8%

* Wall Clock: 60.583562 seconds
* Wall Clock: 58.533960 seconds
* Wall Clock: 70.103489 seconds
* Wall Clock: 74.279028 seconds
* Wall Clock: 69.024194 seconds

I follow(almost) Sean's suggestion:

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0050f39..78b5eb6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1272,6 +1272,7 @@ EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
 void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
 {
     struct kvm_lapic_irq irq;
+    struct kvm_vcpu *dest_vcpu;

     irq.vector = icr_low & APIC_VECTOR_MASK;
     irq.delivery_mode = icr_low & APIC_MODE_MASK;
@@ -1285,6 +1286,10 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic,
u32 icr_low, u32 icr_high)
     else
         irq.dest_id = GET_APIC_DEST_FIELD(icr_high);

+    dest_vcpu = kvm_get_vcpu_by_id(apic->vcpu->kvm, irq.dest_id);
+    if (dest_vcpu)
+        WRITE_ONCE(dest_vcpu->ipi_received, true);
+
     trace_kvm_apic_ipi(icr_low, irq.dest_id);

     kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq, NULL);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 303fb55..a98bf571 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9298,6 +9298,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
     if (test_thread_flag(TIF_NEED_FPU_LOAD))
         switch_fpu_return();

+    WRITE_ONCE(vcpu->ipi_received, false);
+
     if (unlikely(vcpu->arch.switch_db_regs)) {
         set_debugreg(0, 7);
         set_debugreg(vcpu->arch.eff_db[0], 0);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5ef09a4..81e39fa 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -332,6 +332,8 @@ struct kvm_vcpu {
         bool dy_eligible;
     } spin_loop;
 #endif
+
+    bool ipi_received;
     bool preempted;
     bool ready;
     struct kvm_vcpu_arch arch;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c682f82..5098929 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -411,6 +411,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu,
struct kvm *kvm, unsigned id)

     kvm_vcpu_set_in_spin_loop(vcpu, false);
     kvm_vcpu_set_dy_eligible(vcpu, false);
+    vcpu->ipi_received = false;
     vcpu->preempted = false;
     vcpu->ready = false;
     preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
@@ -3220,6 +3221,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool
yield_to_kernel_mode)
                 !vcpu_dy_runnable(vcpu))
                 continue;
             if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
+                !READ_ONCE(vcpu->ipi_received) &&
                 !kvm_arch_vcpu_in_kernel(vcpu))
                 continue;
             if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
Sean Christopherson April 22, 2021, 3:58 p.m. UTC | #4
On Thu, Apr 22, 2021, Wanpeng Li wrote:
> On Thu, 22 Apr 2021 at 09:45, Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Apr 22, 2021, Kenta Ishiguro wrote:
> > > To solve problems (2) and (3), patch 2 monitors IPI communication between
> > > vCPUs and leverages the relationship between vCPUs to select boost
> > > candidates.  The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
> > > delivering interrupt" patch
> > > (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
> > > seems to be effective for (2) while it only uses the IPI receiver
> > > information.
> >
> > On the IPI side of thing, I like the idea of explicitly tracking the IPIs,
> > especially if we can simplify the implementation, e.g. by losing the receiver
> > info and making ipi_received a bool.  Maybe temporarily table Wanpeng's patch
> > while this approach is analyzed?
> 
> Hi all,
> 
> I evaluate my patch

Thanks for doing the testing, much appreciated!

> (https://lore.kernel.org/kvm/1618542490-14756-1-git-send-email-wanpengli@tencent.com),
> Kenta's patch 2 and Sean's suggestion. The testing environment is
> pbzip2 in 96 vCPUs VM in over-subscribe scenario (The host machine is
> 2 socket, 48 cores, 96 HTs Intel CLX box).

Are the vCPUs affined in any way?  How many VMs are running?  Are there other
workloads in the host?  Not criticising, just asking so that others can reproduce
your setup.

> Note: the Kenta's scheduler hacking is not applied. The score of my patch is
> the most stable and the best performance.

On the other hand, Kenta's approach has the advantage of working for both Intel
and AMD.  But I'm also not very familiar with AMD's AVIC, so I don't know if it's
feasible to implement a performant equivalent in svm_dy_apicv_has_pending_interrupt().

Kenda's patch is also flawed as it doesn't scale to 96 vCPUs; vCPUs 64-95 will
never get boosted.

> Wanpeng's patch
> 
> The average: vanilla -> boost: 69.124 -> 61.975, 10.3%
> 
> * Wall Clock: 61.695359 seconds
> * Wall Clock: 63.343579 seconds
> * Wall Clock: 61.567513 seconds
> * Wall Clock: 62.144722 seconds
> * Wall Clock: 61.091442 seconds
> * Wall Clock: 62.085912 seconds
> * Wall Clock: 61.311954 seconds
> 
> Kenta' patch
> 
> The average: vanilla -> boost: 69.148 -> 64.567, 6.6%
> 
> * Wall Clock:  66.288113 seconds
> * Wall Clock:  61.228642 seconds
> * Wall Clock:  62.100524 seconds
> * Wall Clock:  68.355473 seconds
> * Wall Clock:  64.864608 seconds
> 
> Sean's suggestion:
> 
> The average: vanilla -> boost: 69.148 -> 66.505, 3.8%
> 
> * Wall Clock: 60.583562 seconds
> * Wall Clock: 58.533960 seconds
> * Wall Clock: 70.103489 seconds
> * Wall Clock: 74.279028 seconds
> * Wall Clock: 69.024194 seconds
Wanpeng Li April 26, 2021, 3:02 a.m. UTC | #5
On Mon, 26 Apr 2021 at 10:56, Kenta Ishiguro
<kentaishiguro@sslab.ics.keio.ac.jp> wrote:
>
> Dear all,
>
> Thank you for the insightful feedback.
>
> Does Sean's suggested version of Wanpeng's patch mark a running vCPU as an IPI
> receiver? If it's right, I think the candidate set of vCPUs for boost is
> slightly different between using kvm_arch_interrupt_delivery and using boolean
> ipi_received. In the version of using boolean ipi_received, vCPUs which
> receive IPI while running are also candidates for a boost.
> However, they likely have already responded to their IPI before they exit.

             if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
+                !READ_ONCE(vcpu->ipi_received) &&

There is a vcpu->preempted checking here.

    Wanpeng
Kenta Ishiguro April 26, 2021, 3:18 a.m. UTC | #6
Thank you for the reply.

My question is about following scenario:
1. running vCPU receives IPI and the vCPU's ipi_received gets true
2. the vCPU responds to the IPI
3. the vCPU exits
4. the vCPU is preempted by KVM
5. the vCPU is boosted, but it has already responded to the IPI
6. the vCPU enters and the vCPU's ipi_received is cleaned

In this case, I think the check of vcpu->preempted does not limit the candidate vCPUs.
Wanpeng Li April 26, 2021, 3:58 a.m. UTC | #7
On Mon, 26 Apr 2021 at 11:19, Kenta Ishiguro
<kentaishiguro@sslab.ics.keio.ac.jp> wrote:
>
> Thank you for the reply.
>
> My question is about following scenario:
> 1. running vCPU receives IPI and the vCPU's ipi_received gets true
> 2. the vCPU responds to the IPI
> 3. the vCPU exits
> 4. the vCPU is preempted by KVM
> 5. the vCPU is boosted, but it has already responded to the IPI
> 6. the vCPU enters and the vCPU's ipi_received is cleaned
>
> In this case, I think the check of vcpu->preempted does not limit the candidate vCPUs.

Good point, you are right. However, actually I played with that code a
bit before, I have another version adding the vcpu->preempted checking
when marking IPI receiver, the score is not as good as expected.

    Wanpeng
Kenta Ishiguro April 26, 2021, 4:10 a.m. UTC | #8
I see. Thank you!