Message ID | 20210421150831.60133-1-kentaishiguro@sslab.ics.keio.ac.jp (mailing list archive) |
---|---|
Headers | show |
Series | Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM | expand |
On Thu, Apr 22, 2021, Kenta Ishiguro wrote: > To solve problems (2) and (3), patch 2 monitors IPI communication between > vCPUs and leverages the relationship between vCPUs to select boost > candidates. The "[PATCH] KVM: Boost vCPU candidiate in user mode which is > delivering interrupt" patch > (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/) > seems to be effective for (2) while it only uses the IPI receiver > information. On the IPI side of thing, I like the idea of explicitly tracking the IPIs, especially if we can simplify the implementation, e.g. by losing the receiver info and making ipi_received a bool. Maybe temporarily table Wanpeng's patch while this approach is analyzed?
On Thu, 22 Apr 2021 at 06:13, Kenta Ishiguro <kentaishiguro@sslab.ics.keio.ac.jp> wrote: > > Dear KVM developers and maintainers, > > In our research work presented last week at the VEE 2021 conference [1], we > found out that a lot of continuous Pause-Loop-Exiting (PLE) events occur > due to three problems we have identified: 1) Linux CFS ignores hints from > KVM; 2) IPI receiver vCPUs in user-mode are not boosted; 3) IPI-receiver > that has halted is always a candidate for boost. We have intoduced two > mitigations against the problems. > > To solve problem (1), patch 1 increases the vruntime of yielded vCPU to > pass the check `if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, > left) < 1)` in `struct sched_entity * pick_next_entity()` if the cfs_rq's > skip and next are both vCPUs in the same VM. To keep fairness it does not > prioritize the guest VM which causes PLE, however it improves the > performance by eliminating unnecessary PLE. Also we have confirmed > `yield_to_task_fair` is called only from KVM. > > To solve problems (2) and (3), patch 2 monitors IPI communication between > vCPUs and leverages the relationship between vCPUs to select boost > candidates. The "[PATCH] KVM: Boost vCPU candidiate in user mode which is > delivering interrupt" patch > (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/) > seems to be effective for (2) while it only uses the IPI receiver > information. > > Our approach reduces the total number of PLE events by up to 87.6 % in four > 8-vCPU VMs in over-subscribed scenario with the Linux kernel 5.6.0. Please > find the patch below. You should mention that this improvement mainly comes from your problems (1) scheduler hacking, however, kvm task is just an ordinary task and scheduler maintainer always does not accept special treatment. the worst case for problems (1) mentioned in your paper, I guess it is vCPU stacking issue, I try to mitigate it before (https://lore.kernel.org/kvm/1564479235-25074-1-git-send-email-wanpengli@tencent.com/). For your problems (3), we evaluate hackbench which is heavily contended rq locks and heavy async ipi(reschedule ipi), the async ipi influence is around 0.X%, I don't expect normal workloads can feel any affected. In addition, four 8-vCPU VMs are not suitable for scalability evaluation. I don't think the complex which is introduced by your patch 2 is worth it since it gets a similar effect as my version w/ current heuristic algorithm. Wanpeng
On Thu, 22 Apr 2021 at 09:45, Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Apr 22, 2021, Kenta Ishiguro wrote: > > To solve problems (2) and (3), patch 2 monitors IPI communication between > > vCPUs and leverages the relationship between vCPUs to select boost > > candidates. The "[PATCH] KVM: Boost vCPU candidiate in user mode which is > > delivering interrupt" patch > > (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/) > > seems to be effective for (2) while it only uses the IPI receiver > > information. > > On the IPI side of thing, I like the idea of explicitly tracking the IPIs, > especially if we can simplify the implementation, e.g. by losing the receiver > info and making ipi_received a bool. Maybe temporarily table Wanpeng's patch > while this approach is analyzed? Hi all, I evaluate my patch (https://lore.kernel.org/kvm/1618542490-14756-1-git-send-email-wanpengli@tencent.com), Kenta's patch 2 and Sean's suggestion. The testing environment is pbzip2 in 96 vCPUs VM in over-subscribe scenario (The host machine is 2 socket, 48 cores, 96 HTs Intel CLX box). Note: the Kenta's scheduler hacking is not applied. The score of my patch is the most stable and the best performance. Wanpeng's patch The average: vanilla -> boost: 69.124 -> 61.975, 10.3% * Wall Clock: 61.695359 seconds * Wall Clock: 63.343579 seconds * Wall Clock: 61.567513 seconds * Wall Clock: 62.144722 seconds * Wall Clock: 61.091442 seconds * Wall Clock: 62.085912 seconds * Wall Clock: 61.311954 seconds Kenta' patch The average: vanilla -> boost: 69.148 -> 64.567, 6.6% * Wall Clock: 66.288113 seconds * Wall Clock: 61.228642 seconds * Wall Clock: 62.100524 seconds * Wall Clock: 68.355473 seconds * Wall Clock: 64.864608 seconds Sean's suggestion: The average: vanilla -> boost: 69.148 -> 66.505, 3.8% * Wall Clock: 60.583562 seconds * Wall Clock: 58.533960 seconds * Wall Clock: 70.103489 seconds * Wall Clock: 74.279028 seconds * Wall Clock: 69.024194 seconds I follow(almost) Sean's suggestion: diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 0050f39..78b5eb6 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1272,6 +1272,7 @@ EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated); void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) { struct kvm_lapic_irq irq; + struct kvm_vcpu *dest_vcpu; irq.vector = icr_low & APIC_VECTOR_MASK; irq.delivery_mode = icr_low & APIC_MODE_MASK; @@ -1285,6 +1286,10 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) else irq.dest_id = GET_APIC_DEST_FIELD(icr_high); + dest_vcpu = kvm_get_vcpu_by_id(apic->vcpu->kvm, irq.dest_id); + if (dest_vcpu) + WRITE_ONCE(dest_vcpu->ipi_received, true); + trace_kvm_apic_ipi(icr_low, irq.dest_id); kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq, NULL); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 303fb55..a98bf571 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9298,6 +9298,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (test_thread_flag(TIF_NEED_FPU_LOAD)) switch_fpu_return(); + WRITE_ONCE(vcpu->ipi_received, false); + if (unlikely(vcpu->arch.switch_db_regs)) { set_debugreg(0, 7); set_debugreg(vcpu->arch.eff_db[0], 0); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 5ef09a4..81e39fa 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -332,6 +332,8 @@ struct kvm_vcpu { bool dy_eligible; } spin_loop; #endif + + bool ipi_received; bool preempted; bool ready; struct kvm_vcpu_arch arch; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index c682f82..5098929 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -411,6 +411,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id) kvm_vcpu_set_in_spin_loop(vcpu, false); kvm_vcpu_set_dy_eligible(vcpu, false); + vcpu->ipi_received = false; vcpu->preempted = false; vcpu->ready = false; preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops); @@ -3220,6 +3221,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) !vcpu_dy_runnable(vcpu)) continue; if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode && + !READ_ONCE(vcpu->ipi_received) && !kvm_arch_vcpu_in_kernel(vcpu)) continue; if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
On Thu, Apr 22, 2021, Wanpeng Li wrote: > On Thu, 22 Apr 2021 at 09:45, Sean Christopherson <seanjc@google.com> wrote: > > > > On Thu, Apr 22, 2021, Kenta Ishiguro wrote: > > > To solve problems (2) and (3), patch 2 monitors IPI communication between > > > vCPUs and leverages the relationship between vCPUs to select boost > > > candidates. The "[PATCH] KVM: Boost vCPU candidiate in user mode which is > > > delivering interrupt" patch > > > (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/) > > > seems to be effective for (2) while it only uses the IPI receiver > > > information. > > > > On the IPI side of thing, I like the idea of explicitly tracking the IPIs, > > especially if we can simplify the implementation, e.g. by losing the receiver > > info and making ipi_received a bool. Maybe temporarily table Wanpeng's patch > > while this approach is analyzed? > > Hi all, > > I evaluate my patch Thanks for doing the testing, much appreciated! > (https://lore.kernel.org/kvm/1618542490-14756-1-git-send-email-wanpengli@tencent.com), > Kenta's patch 2 and Sean's suggestion. The testing environment is > pbzip2 in 96 vCPUs VM in over-subscribe scenario (The host machine is > 2 socket, 48 cores, 96 HTs Intel CLX box). Are the vCPUs affined in any way? How many VMs are running? Are there other workloads in the host? Not criticising, just asking so that others can reproduce your setup. > Note: the Kenta's scheduler hacking is not applied. The score of my patch is > the most stable and the best performance. On the other hand, Kenta's approach has the advantage of working for both Intel and AMD. But I'm also not very familiar with AMD's AVIC, so I don't know if it's feasible to implement a performant equivalent in svm_dy_apicv_has_pending_interrupt(). Kenda's patch is also flawed as it doesn't scale to 96 vCPUs; vCPUs 64-95 will never get boosted. > Wanpeng's patch > > The average: vanilla -> boost: 69.124 -> 61.975, 10.3% > > * Wall Clock: 61.695359 seconds > * Wall Clock: 63.343579 seconds > * Wall Clock: 61.567513 seconds > * Wall Clock: 62.144722 seconds > * Wall Clock: 61.091442 seconds > * Wall Clock: 62.085912 seconds > * Wall Clock: 61.311954 seconds > > Kenta' patch > > The average: vanilla -> boost: 69.148 -> 64.567, 6.6% > > * Wall Clock: 66.288113 seconds > * Wall Clock: 61.228642 seconds > * Wall Clock: 62.100524 seconds > * Wall Clock: 68.355473 seconds > * Wall Clock: 64.864608 seconds > > Sean's suggestion: > > The average: vanilla -> boost: 69.148 -> 66.505, 3.8% > > * Wall Clock: 60.583562 seconds > * Wall Clock: 58.533960 seconds > * Wall Clock: 70.103489 seconds > * Wall Clock: 74.279028 seconds > * Wall Clock: 69.024194 seconds
On Mon, 26 Apr 2021 at 10:56, Kenta Ishiguro <kentaishiguro@sslab.ics.keio.ac.jp> wrote: > > Dear all, > > Thank you for the insightful feedback. > > Does Sean's suggested version of Wanpeng's patch mark a running vCPU as an IPI > receiver? If it's right, I think the candidate set of vCPUs for boost is > slightly different between using kvm_arch_interrupt_delivery and using boolean > ipi_received. In the version of using boolean ipi_received, vCPUs which > receive IPI while running are also candidates for a boost. > However, they likely have already responded to their IPI before they exit. if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode && + !READ_ONCE(vcpu->ipi_received) && There is a vcpu->preempted checking here. Wanpeng
Thank you for the reply. My question is about following scenario: 1. running vCPU receives IPI and the vCPU's ipi_received gets true 2. the vCPU responds to the IPI 3. the vCPU exits 4. the vCPU is preempted by KVM 5. the vCPU is boosted, but it has already responded to the IPI 6. the vCPU enters and the vCPU's ipi_received is cleaned In this case, I think the check of vcpu->preempted does not limit the candidate vCPUs.
On Mon, 26 Apr 2021 at 11:19, Kenta Ishiguro <kentaishiguro@sslab.ics.keio.ac.jp> wrote: > > Thank you for the reply. > > My question is about following scenario: > 1. running vCPU receives IPI and the vCPU's ipi_received gets true > 2. the vCPU responds to the IPI > 3. the vCPU exits > 4. the vCPU is preempted by KVM > 5. the vCPU is boosted, but it has already responded to the IPI > 6. the vCPU enters and the vCPU's ipi_received is cleaned > > In this case, I think the check of vcpu->preempted does not limit the candidate vCPUs. Good point, you are right. However, actually I played with that code a bit before, I have another version adding the vcpu->preempted checking when marking IPI receiver, the score is not as good as expected. Wanpeng
I see. Thank you!