diff mbox series

[v2,11/43] KVM: Don't block+unblock when halt-polling is successful

Message ID 20211009021236.4122790-12-seanjc@google.com (mailing list archive)
State Not Applicable
Headers show
Series KVM: Halt-polling and x86 APICv overhaul | expand

Commit Message

Sean Christopherson Oct. 9, 2021, 2:12 a.m. UTC
Invoke the arch hooks for block+unblock if and only if KVM actually
attempts to block the vCPU.  The only non-nop implementation is on x86,
specifically SVM's AVIC, and there is no need to put the AVIC prior to
halt-polling as KVM x86's kvm_vcpu_has_events() will scour the full vIRR
to find pending IRQs regardless of whether the AVIC is loaded/"running".

The primary motivation is to allow future cleanup to split out "block"
from "halt", but this is also likely a small performance boost on x86 SVM
when halt-polling is successful.

Adjust the post-block path to update "cur" after unblocking, i.e. include
AVIC load time in halt_wait_ns and halt_wait_hist, so that the behavior
is consistent.  Moving just the pre-block arch hook would result in only
the AVIC put latency being included in the halt_wait stats.  There is no
obvious evidence that one way or the other is correct, so just ensure KVM
is consistent.

Note, x86 has two separate paths for handling APICv with respect to vCPU
blocking.  VMX uses hooks in x86's vcpu_block(), while SVM uses the arch
hooks in kvm_vcpu_block().  Prior to this path, the two paths were more
or less functionally identical.  That is very much not the case after
this patch, as the hooks used by VMX _must_ fire before halt-polling.
x86's entire mess will be cleaned up in future patches.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Maxim Levitsky Oct. 27, 2021, 1:40 p.m. UTC | #1
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> Invoke the arch hooks for block+unblock if and only if KVM actually
> attempts to block the vCPU.  The only non-nop implementation is on x86,
> specifically SVM's AVIC, and there is no need to put the AVIC prior to
> halt-polling as KVM x86's kvm_vcpu_has_events() will scour the full vIRR
> to find pending IRQs regardless of whether the AVIC is loaded/"running".
> 
> The primary motivation is to allow future cleanup to split out "block"
> from "halt", but this is also likely a small performance boost on x86 SVM
> when halt-polling is successful.
> 
> Adjust the post-block path to update "cur" after unblocking, i.e. include
> AVIC load time in halt_wait_ns and halt_wait_hist, so that the behavior
> is consistent.  Moving just the pre-block arch hook would result in only
> the AVIC put latency being included in the halt_wait stats.  There is no
> obvious evidence that one way or the other is correct, so just ensure KVM
> is consistent.
> 
> Note, x86 has two separate paths for handling APICv with respect to vCPU
> blocking.  VMX uses hooks in x86's vcpu_block(), while SVM uses the arch
> hooks in kvm_vcpu_block().  Prior to this path, the two paths were more
> or less functionally identical.  That is very much not the case after
> this patch, as the hooks used by VMX _must_ fire before halt-polling.
> x86's entire mess will be cleaned up in future patches.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/kvm_main.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f90b3ed05628..227f6bbe0716 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3235,8 +3235,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  	bool waited = false;
>  	u64 block_ns;
>  
> -	kvm_arch_vcpu_blocking(vcpu);
> -
>  	start = cur = poll_end = ktime_get();
>  	if (do_halt_poll) {
>  		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
> @@ -3253,6 +3251,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  		} while (kvm_vcpu_can_poll(cur, stop));
>  	}
>  
> +	kvm_arch_vcpu_blocking(vcpu);
>  
>  	prepare_to_rcuwait(wait);
>  	for (;;) {
> @@ -3265,6 +3264,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  		schedule();
>  	}
>  	finish_rcuwait(wait);
> +
> +	kvm_arch_vcpu_unblocking(vcpu);
> +
>  	cur = ktime_get();
>  	if (waited) {
>  		vcpu->stat.generic.halt_wait_ns +=
> @@ -3273,7 +3275,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  				ktime_to_ns(cur) - ktime_to_ns(poll_end));
>  	}
>  out:
> -	kvm_arch_vcpu_unblocking(vcpu);
>  	block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
>  
>  	/*

Makes sense.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky
Maxim Levitsky Nov. 28, 2021, 10:16 p.m. UTC | #2
On Wed, 2021-10-27 at 16:40 +0300, Maxim Levitsky wrote:
> On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote:
> > Invoke the arch hooks for block+unblock if and only if KVM actually
> > attempts to block the vCPU.  The only non-nop implementation is on x86,
> > specifically SVM's AVIC, and there is no need to put the AVIC prior to
> > halt-polling as KVM x86's kvm_vcpu_has_events() will scour the full vIRR
> > to find pending IRQs regardless of whether the AVIC is loaded/"running".
> > 
> > The primary motivation is to allow future cleanup to split out "block"
> > from "halt", but this is also likely a small performance boost on x86 SVM
> > when halt-polling is successful.
> > 
> > Adjust the post-block path to update "cur" after unblocking, i.e. include
> > AVIC load time in halt_wait_ns and halt_wait_hist, so that the behavior
> > is consistent.  Moving just the pre-block arch hook would result in only
> > the AVIC put latency being included in the halt_wait stats.  There is no
> > obvious evidence that one way or the other is correct, so just ensure KVM
> > is consistent.
> > 
> > Note, x86 has two separate paths for handling APICv with respect to vCPU
> > blocking.  VMX uses hooks in x86's vcpu_block(), while SVM uses the arch
> > hooks in kvm_vcpu_block().  Prior to this path, the two paths were more
> > or less functionally identical.  That is very much not the case after
> > this patch, as the hooks used by VMX _must_ fire before halt-polling.
> > x86's entire mess will be cleaned up in future patches.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  virt/kvm/kvm_main.c | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index f90b3ed05628..227f6bbe0716 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -3235,8 +3235,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> >  	bool waited = false;
> >  	u64 block_ns;
> >  
> > -	kvm_arch_vcpu_blocking(vcpu);
> > -
> >  	start = cur = poll_end = ktime_get();
> >  	if (do_halt_poll) {
> >  		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
> > @@ -3253,6 +3251,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> >  		} while (kvm_vcpu_can_poll(cur, stop));
> >  	}
> >  
> > +	kvm_arch_vcpu_blocking(vcpu);
> >  
> >  	prepare_to_rcuwait(wait);
> >  	for (;;) {
> > @@ -3265,6 +3264,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> >  		schedule();
> >  	}
> >  	finish_rcuwait(wait);
> > +
> > +	kvm_arch_vcpu_unblocking(vcpu);
> > +
> >  	cur = ktime_get();
> >  	if (waited) {
> >  		vcpu->stat.generic.halt_wait_ns +=
> > @@ -3273,7 +3275,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> >  				ktime_to_ns(cur) - ktime_to_ns(poll_end));
> >  	}
> >  out:
> > -	kvm_arch_vcpu_unblocking(vcpu);
> >  	block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
> >  
> >  	/*
> 
> Makes sense.
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Best regards,
> 	Maxim Levitsky


So...

Last week I decided to study a bit how AVIC behaves when vCPUs are not 100% running
(aka no cpu_pm=on), to mostly understand their so-called 'GA log' thing.
 
(This thing is that when you tell the IOMMU that a vCPU is not running,
the IOMMU starts logging all incoming passed-through interrupts to a ring buffer,
and raises its own interrupt, which’s handler is supposed to wake up the VM's vCPU.)
 
That led to me discovering that AMD's IOMMU is totally busted after a suspend/resume cycle,
fixing which took me few days (and most of the time I worried that it's some sort of a BIOS bug which nobody would fix,
as the IOMMU interrupt delivery was totally busted after resume, sometimes even power cycle didn't help
to revive it - phew...). 
Luckily I did fix it, and patches are waiting for the review upstream.
(https://www.spinics.net/lists/kernel/msg4153488.html)
 
 
Another thing I discovered that this patch series totally breaks my VMs, without cpu_pm=on
The whole series (I didn't yet bisect it) makes even my fedora32 VM be very laggy, almost unusable,
and it only has one passed-through device, a nic).
 
If I apply though only the patch series up to this patch, my fedora VM seems to work fine, but
my windows VM still locks up hard when I run 'LatencyTop' in it, which doesn't happen without this patch.
 
 
So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt (0xe1 last time I seen it),
TPR and PPR are 0xe0 (although I have seen TPR to have different values), and IRR has plenty of interrupts
with lower priority. The VM seems to be stuck in this case. As if its EOI got lost or something is preventing
the IRQ handler from issuing EOI.
 
LatencyTop does install some form of a kernel driver which likely does meddle with interrupts (maybe it sends lots of self IPIs?).
 
100% reproducible as soon as I start monitoring with LatencyTop.
 
 
Without this patch it works (or if disabling halt polling),
 
but I still did manage to lockup the VM few times still, after lot of random clicking/starting up various apps while LatencyTop was running,
etc, but in this case when I dump local apic via qemu's hmp interface the VM instantly revives, which might be either same bug
which got amplified by this patch or something else.
That was tested on the pure 5.15.0 kernel without any patches.
 
It is possible that this is a bug in LatencyTop that just got exposed by different timing.
 
The windows VM does have GPU and few USB controllers passed to it, and without them, in pure VM mode, as I call it,
the LatencyTop seems to work.
 

Tomorrow I'll give it a more formal investigation.
 
Best regards,
	Maxim Levitsky
Sean Christopherson Nov. 29, 2021, 5:25 p.m. UTC | #3
On Mon, Nov 29, 2021, Maxim Levitsky wrote:
> (This thing is that when you tell the IOMMU that a vCPU is not running,
> Another thing I discovered that this patch series totally breaks my VMs,
> without cpu_pm=on The whole series (I didn't yet bisect it) makes even my
> fedora32 VM be very laggy, almost unusable, and it only has one
> passed-through device, a nic).

Grrrr, the complete lack of comments in the KVM code and the separate paths for
VMX vs SVM when handling HLT with APICv make this all way for difficult to
understand than it should be.

The hangs are likely due to:

  KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv)

If a posted interrupt arrives after KVM has done its final search through the vIRR,
but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will
be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log.

I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding
notification after switching to the wakeup vector.

For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks.
Unlike VMX's PI support, there's no fast check for an interrupt being posted (KVM
would have to rewalk the vIRR), no easy to signal the current CPU to do wakeup (I
don't think KVM even has access to the IRQ used by the owning IOMMU), and there's
no simplification of load/put code.

If the scheduler were changed to support waking in the sched_out path, then I'd be
more inclined to handle this in avic_vcpu_put() by rewalking the vIRR one final
time, but for now it's not worth it.

> If I apply though only the patch series up to this patch, my fedora VM seems
> to work fine, but my windows VM still locks up hard when I run 'LatencyTop'
> in it, which doesn't happen without this patch.

Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest?
The only search results I can find for LatencyTop are Linux specific.

> So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt
> (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to
> have different values), and IRR has plenty of interrupts with lower priority.
> The VM seems to be stuck in this case. As if its EOI got lost or something is
> preventing the IRQ handler from issuing EOI.
>  
> LatencyTop does install some form of a kernel driver which likely does meddle
> with interrupts (maybe it sends lots of self IPIs?).
>  
> 100% reproducible as soon as I start monitoring with LatencyTop.
>  
> Without this patch it works (or if disabling halt polling),

Huh.  I assume everything works if you disable halt polling _without_ this patch
applied?

If so, that implies that successful halt polling without mucking with vCPU IOMMU
affinity is somehow problematic.  I can't think of any relevant side effects other
than timing.
Paolo Bonzini Nov. 29, 2021, 5:53 p.m. UTC | #4
On 11/29/21 18:25, Sean Christopherson wrote:
> If a posted interrupt arrives after KVM has done its final search through the vIRR,
> but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will
> be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log.
> 
> I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding
> notification after switching to the wakeup vector.

BTW Maxim reported that it can break even without assigned devices.

> For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks.

I agree that the hooks cannot be dropped but the bug is reproducible 
with this patch, where the hooks are still there.

With the hooks in place, you have:

	kvm_vcpu_blocking(vcpu)
	  avic_set_running(vcpu, false)
	    avic_vcpu_put(vcpu)
	      avic_update_iommu_vcpu_affinity()
	      WRITE_ONCE(...) // clear IS_RUNNING bit

	set_current_state()
	  smp_mb()

	kvm_vcpu_check_block()
	  return kvm_arch_vcpu_runnable() || ...
	    return kvm_vcpu_has_events() || ...
	      return kvm_cpu_has_interrupt() || ...
		return kvm_apic_has_interrupt() || ...
		  return apic_has_interrupt_for_ppr()
		    apic_find_highest_irr()
		      scan vIRR

This covers the barrier between the write of is_running and the read of 
vIRR, and the other side should be correct as well.  in particular, 
reads of is_running always come after an atomic write to vIRR, and hence 
after an implicit full memory barrier.  svm_deliver_avic_intr() has an 
smp_mb__after_atomic() after writing IRR; avic_kick_target_vcpus() even 
has an explicit barrier in srcu_read_lock(), between the microcode's 
write to vIRR and its own call to avic_vcpu_is_running().

Still it does seem to be a race that happens when IS_RUNNING=true but 
vcpu->mode == OUTSIDE_GUEST_MODE.  This patch makes the race easier to 
trigger because it moves IS_RUNNING=false later.

Paolo
Paolo Bonzini Nov. 29, 2021, 5:55 p.m. UTC | #5
On 11/29/21 18:25, Sean Christopherson wrote:
>> If I apply though only the patch series up to this patch, my fedora VM seems
>> to work fine, but my windows VM still locks up hard when I run 'LatencyTop'
>> in it, which doesn't happen without this patch.
> 
> Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest?
> The only search results I can find for LatencyTop are Linux specific.

I think it's LatencyMon, https://www.resplendence.com/latencymon.

Paolo
Sean Christopherson Nov. 29, 2021, 6:55 p.m. UTC | #6
On Mon, Nov 29, 2021, Paolo Bonzini wrote:
> On 11/29/21 18:25, Sean Christopherson wrote:
> > If a posted interrupt arrives after KVM has done its final search through the vIRR,
> > but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will
> > be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log.
> > 
> > I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding
> > notification after switching to the wakeup vector.
> 
> BTW Maxim reported that it can break even without assigned devices.
> 
> > For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks.
> 
> I agree that the hooks cannot be dropped but the bug is reproducible with
> this patch, where the hooks are still there.

...

> Still it does seem to be a race that happens when IS_RUNNING=true but
> vcpu->mode == OUTSIDE_GUEST_MODE.  This patch makes the race easier to
> trigger because it moves IS_RUNNING=false later.

Oh!  Any chance the bug only repros with preemption enabled?  That would explain
why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n.

svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running()
passes in vcpu->cpu.  If the vCPU is preempted and scheduled in on a different CPU,
avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info.
Paolo Bonzini Nov. 29, 2021, 7:18 p.m. UTC | #7
On 11/29/21 19:55, Sean Christopherson wrote:
>> Still it does seem to be a race that happens when IS_RUNNING=true but
>> vcpu->mode == OUTSIDE_GUEST_MODE.  This patch makes the race easier to
>> trigger because it moves IS_RUNNING=false later.
> 
> Oh!  Any chance the bug only repros with preemption enabled?  That would explain
> why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n.

Me too.

> svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running()
> passes in vcpu->cpu.  If the vCPU is preempted and scheduled in on a different CPU,
> avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info.

That would make a lot of sense.  avic_vcpu_load() can handle 
svm->avic_is_running = false, but avic_set_running still needs its body 
wrapped by preempt_disable/preempt_enable.

Fedora's kernel is CONFIG_PREEMPT_VOLUNTARY, but I know Maxim uses his 
own build so it would not surprise me if he used CONFIG_PREEMPT=y.

Paolo
Maxim Levitsky Nov. 29, 2021, 10:53 p.m. UTC | #8
On Mon, 2021-11-29 at 20:18 +0100, Paolo Bonzini wrote:
> On 11/29/21 19:55, Sean Christopherson wrote:
> > > Still it does seem to be a race that happens when IS_RUNNING=true but
> > > vcpu->mode == OUTSIDE_GUEST_MODE.  This patch makes the race easier to
> > > trigger because it moves IS_RUNNING=false later.
> > 
> > Oh!  Any chance the bug only repros with preemption enabled?  That would explain
> > why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n.
> 
> Me too.
> 
> > svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running()
> > passes in vcpu->cpu.  If the vCPU is preempted and scheduled in on a different CPU,
> > avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info.
> 
> That would make a lot of sense.  avic_vcpu_load() can handle 
> svm->avic_is_running = false, but avic_set_running still needs its body 
> wrapped by preempt_disable/preempt_enable.
> 
> Fedora's kernel is CONFIG_PREEMPT_VOLUNTARY, but I know Maxim uses his 
> own build so it would not surprise me if he used CONFIG_PREEMPT=y.
> 
> Paolo
> 

I will write ll the details tomorrow but I strongly suspect the CPU errata 
https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
#1235
 
Basically what I see that
 
1. vCPU2 disables is_running in avic physical id cache
2. vCPU2 checks that IRR is empty and it is
3. vCPU2 does schedule();
 
and it keeps on sleeping forever. If I kick it via signal 
(like just doing 'info registers' qemu hmp command
or just stop/cont on the same hmp interface, the
vCPU wakes up and notices that IRR suddenly is not empty,
and the VM comes back to life (and then hangs after a while again
with the same problem....).
 
As far as I see in the traces, the bit in IRR came from
another VCPU who didn't respect the ir_running bit and didn't get 
AVIC_INCOMPLETE_IPI VMexit.
I can't 100% prove it yet, but everything in the trace shows this.
 
About the rest of the environment, currently I reproduce this in
a VM which has no pci passed through devices at all, just AVIC.
(I wasn't able to reproduce it before just because I forgot to
enable AVIC in this configuration).
 
So I also agree that Sean's patch is not to blame here,
it just made the window between setting is_running and getting to sleep
shorter and made it less likely that other vCPUs will pick up the is_running change.
(I suspect that they pick it up on next vmrun, and otherwise the value is somehow
cached wrongfully in them).
 
A very performance killing workaround of kicking all vCPUs when one of them enters vcpu_block
does seem to work for me but it skews all the timing off so I can't prove it.
 
That is all, I will write more detailed info, including some traces I have.
 
I do use windows 10 with so called LatencyMon in it, which shows overall how
much latency hardware interrupts have, which used to be useful for me to
ensure that my VMs are suitable for RT like latency (even before I joined RedHat,
I tuned my VMs as much as I could to make my Rift CV1 VR headset work well which 
needs RT like latencies.
 
These days VR works fine in my VMs anyway, but I still kept this tool to keep an eye on it).
 
I really need to write a kvm unit test to stress test IPIs, especially this case,
I will do this very soon.
 
 
Wei Huang, any info on this would be very helpful. 
 
Maybe putting the avic physical table in UC memory would help? 
Maybe ringing doorbells of all other vcpus will help them notice the change?

Best regards,
	Maxim Levitsky
Maxim Levitsky Nov. 29, 2021, 10:55 p.m. UTC | #9
On Mon, 2021-11-29 at 18:55 +0100, Paolo Bonzini wrote:
> On 11/29/21 18:25, Sean Christopherson wrote:
> > > If I apply though only the patch series up to this patch, my fedora VM seems
> > > to work fine, but my windows VM still locks up hard when I run 'LatencyTop'
> > > in it, which doesn't happen without this patch.
> > 
> > Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest?
> > The only search results I can find for LatencyTop are Linux specific.
> 
> I think it's LatencyMon, https://www.resplendence.com/latencymon.
> 
> Paolo
> 
Yes.

Best regards,
	Maxim Levitsky
Maxim Levitsky Dec. 2, 2021, 12:20 a.m. UTC | #10
On Tue, 2021-11-30 at 00:53 +0200, Maxim Levitsky wrote:
> On Mon, 2021-11-29 at 20:18 +0100, Paolo Bonzini wrote:
> > On 11/29/21 19:55, Sean Christopherson wrote:
> > > > Still it does seem to be a race that happens when IS_RUNNING=true but
> > > > vcpu->mode == OUTSIDE_GUEST_MODE.  This patch makes the race easier to
> > > > trigger because it moves IS_RUNNING=false later.
> > > 
> > > Oh!  Any chance the bug only repros with preemption enabled?  That would explain
> > > why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n.
> > 
> > Me too.
> > 
> > > svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running()
> > > passes in vcpu->cpu.  If the vCPU is preempted and scheduled in on a different CPU,
> > > avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info.
> > 
> > That would make a lot of sense.  avic_vcpu_load() can handle 
> > svm->avic_is_running = false, but avic_set_running still needs its body 
> > wrapped by preempt_disable/preempt_enable.
> > 
> > Fedora's kernel is CONFIG_PREEMPT_VOLUNTARY, but I know Maxim uses his 
> > own build so it would not surprise me if he used CONFIG_PREEMPT=y.
> > 
> > Paolo
> > 
> 
> I will write ll the details tomorrow but I strongly suspect the CPU errata 
> https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> #1235
>  
> Basically what I see that
>  
> 1. vCPU2 disables is_running in avic physical id cache
> 2. vCPU2 checks that IRR is empty and it is
> 3. vCPU2 does schedule();
>  
> and it keeps on sleeping forever. If I kick it via signal 
> (like just doing 'info registers' qemu hmp command
> or just stop/cont on the same hmp interface, the
> vCPU wakes up and notices that IRR suddenly is not empty,
> and the VM comes back to life (and then hangs after a while again
> with the same problem....).
>  
> As far as I see in the traces, the bit in IRR came from
> another VCPU who didn't respect the ir_running bit and didn't get 
> AVIC_INCOMPLETE_IPI VMexit.
> I can't 100% prove it yet, but everything in the trace shows this.
>  
> About the rest of the environment, currently I reproduce this in
> a VM which has no pci passed through devices at all, just AVIC.
> (I wasn't able to reproduce it before just because I forgot to
> enable AVIC in this configuration).
>  
> So I also agree that Sean's patch is not to blame here,
> it just made the window between setting is_running and getting to sleep
> shorter and made it less likely that other vCPUs will pick up the is_running change.
> (I suspect that they pick it up on next vmrun, and otherwise the value is somehow
> cached wrongfully in them).
>  
> A very performance killing workaround of kicking all vCPUs when one of them enters vcpu_block
> does seem to work for me but it skews all the timing off so I can't prove it.
>  
> That is all, I will write more detailed info, including some traces I have.
>  
> I do use windows 10 with so called LatencyMon in it, which shows overall how
> much latency hardware interrupts have, which used to be useful for me to
> ensure that my VMs are suitable for RT like latency (even before I joined RedHat,
> I tuned my VMs as much as I could to make my Rift CV1 VR headset work well which 
> needs RT like latencies.
>  
> These days VR works fine in my VMs anyway, but I still kept this tool to keep an eye on it).
>  
> I really need to write a kvm unit test to stress test IPIs, especially this case,
> I will do this very soon.
>  
>  
> Wei Huang, any info on this would be very helpful. 
>  
> Maybe putting the avic physical table in UC memory would help? 
> Maybe ringing doorbells of all other vcpus will help them notice the change?
> 
> Best regards,
> 	Maxim Levitsky


Hi!

I am now almost sure that this is errata #1235.

I had attached a kvm-unit-test I wrote (patch against master of https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git/)
which is able to reproduce the issue on stock 5.15.0 kernel (*no patches applied at all*) after just few seconds.
If kvm is loaded without halt-polling (that is  halt_poll_ns=0 is used).

Halt polling and/or Sean's patch are not to blame, it just changes timeing.
With Sean's patch I don't need to disable half polling.

I did find few avic inhibition bugs that this test also finds and to make it work before I fix them,
I added a workaround to not hit them in this test.
I'll send patches to fix those very soon.
Note that in windows VM there were no avic inhibitions so those bugs are not relevant.

Wei Huang, do you know if this issue is fixed on Zen3, and if it is fixed on some Zen2 machines?
Any workarounds other than 'don't use avic'?

Best regards,
	Maxim Levitsky
Sean Christopherson Dec. 2, 2021, 2 a.m. UTC | #11
On Thu, Dec 02, 2021, Maxim Levitsky wrote:
> On Tue, 2021-11-30 at 00:53 +0200, Maxim Levitsky wrote:
> > On Mon, 2021-11-29 at 20:18 +0100, Paolo Bonzini wrote:
> > Basically what I see that
> >  
> > 1. vCPU2 disables is_running in avic physical id cache
> > 2. vCPU2 checks that IRR is empty and it is
> > 3. vCPU2 does schedule();
> >  
> > and it keeps on sleeping forever. If I kick it via signal 
> > (like just doing 'info registers' qemu hmp command
> > or just stop/cont on the same hmp interface, the
> > vCPU wakes up and notices that IRR suddenly is not empty,
> > and the VM comes back to life (and then hangs after a while again
> > with the same problem....).
> >  
> > As far as I see in the traces, the bit in IRR came from
> > another VCPU who didn't respect the ir_running bit and didn't get 
> > AVIC_INCOMPLETE_IPI VMexit.
> > I can't 100% prove it yet, but everything in the trace shows this.

...

> I am now almost sure that this is errata #1235.
> 
> I had attached a kvm-unit-test I wrote (patch against master of
> https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git/) which is able to
> reproduce the issue on stock 5.15.0 kernel (*no patches applied at all*)
> after just few seconds.  If kvm is loaded without halt-polling (that is
> halt_poll_ns=0 is used).
> 
> Halt polling and/or Sean's patch are not to blame, it just changes timeing.
> With Sean's patch I don't need to disable half polling.

Hmm, that suggests the bug/erratum is due to the CPU consuming stale data from #4
for the IsRunning check in #5, or retiring uops for the IsRunning check before
retiring the vIRR update.  It would be helpful if the erratum actually provided
info on the "highly specific and detailed set of internal timing conditions". :-/

  4. Lookup the vAPIC backing page address in the Physical APIC table using the
     guest physical APIC ID as an index into the table.
  5. For every valid destination:
     - Atomically set the appropriate IRR bit in each of the destinations’ vAPIC
       backing page.
     - Check the IsRunning status of each destination.
Maxim Levitsky Dec. 2, 2021, 10:20 a.m. UTC | #12
On Mon, 2021-11-29 at 17:25 +0000, Sean Christopherson wrote:
> On Mon, Nov 29, 2021, Maxim Levitsky wrote:
> > (This thing is that when you tell the IOMMU that a vCPU is not running,
> > Another thing I discovered that this patch series totally breaks my VMs,
> > without cpu_pm=on The whole series (I didn't yet bisect it) makes even my
> > fedora32 VM be very laggy, almost unusable, and it only has one
> > passed-through device, a nic).
> 
> Grrrr, the complete lack of comments in the KVM code and the separate paths for
> VMX vs SVM when handling HLT with APICv make this all way for difficult to
> understand than it should be.
> 
> The hangs are likely due to:
> 
>   KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv)
> 
> If a posted interrupt arrives after KVM has done its final search through the vIRR,
> but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will
> be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log.
> 
> I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding
> notification after switching to the wakeup vector.
> 
> For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks.
> Unlike VMX's PI support, there's no fast check for an interrupt being posted (KVM
> would have to rewalk the vIRR), no easy to signal the current CPU to do wakeup (I
> don't think KVM even has access to the IRQ used by the owning IOMMU), and there's
> no simplification of load/put code.

I have an idea.
 
Why do we even use/need the GA log?
Why not, just disable the 'guest mode' in the iommu and let it sent good old normal interrupt
when a vCPU is not running, just like we do when we inhibit the AVIC?
 
GA log makes all devices that share an iommu (there are 4 iommus per package these days,
some without useful devices) go through a single (!) msi like interrupt,
which is even for some reason implemented by a threaded IRQ in the linux kernel.

 
Best regards,
	Maxim Levitsky

> 
> If the scheduler were changed to support waking in the sched_out path, then I'd be
> more inclined to handle this in avic_vcpu_put() by rewalking the vIRR one final
> time, but for now it's not worth it.
> 
> > If I apply though only the patch series up to this patch, my fedora VM seems
> > to work fine, but my windows VM still locks up hard when I run 'LatencyTop'
> > in it, which doesn't happen without this patch.
> 
> Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest?
> The only search results I can find for LatencyTop are Linux specific.
> 
> > So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt
> > (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to
> > have different values), and IRR has plenty of interrupts with lower priority.
> > The VM seems to be stuck in this case. As if its EOI got lost or something is
> > preventing the IRQ handler from issuing EOI.
> >  
> > LatencyTop does install some form of a kernel driver which likely does meddle
> > with interrupts (maybe it sends lots of self IPIs?).
> >  
> > 100% reproducible as soon as I start monitoring with LatencyTop.
> >  
> > Without this patch it works (or if disabling halt polling),
> 
> Huh.  I assume everything works if you disable halt polling _without_ this patch
> applied?
> 
> If so, that implies that successful halt polling without mucking with vCPU IOMMU
> affinity is somehow problematic.  I can't think of any relevant side effects other
> than timing.
>
Paolo Bonzini Dec. 2, 2021, 10:31 a.m. UTC | #13
On 12/2/21 03:00, Sean Christopherson wrote:
> Hmm, that suggests the bug/erratum is due to the CPU consuming stale data from #4
> for the IsRunning check in #5, or retiring uops for the IsRunning check before
> retiring the vIRR update.

Yes, this seems to be an error in the implementation of step 5.  In 
assembly, atomic operations have implicit memory barriers, but who knows 
what's going on in microcode.  So either it's the former, or something 
is going on that's specific to the microcode sequencer, or it's a more 
mundane implementation bug.

In any case, AVIC is disabled for now and will need a list of model 
where it works, so I'll go on and queue the first part of this series.

Paolo

> It would be helpful if the erratum actually provided
> info on the "highly specific and detailed set of internal timing conditions". :-/
> 
>    4. Lookup the vAPIC backing page address in the Physical APIC table using the
>       guest physical APIC ID as an index into the table.
>    5. For every valid destination:
>       - Atomically set the appropriate IRR bit in each of the destinations’ vAPIC
>         backing page.
>       - Check the IsRunning status of each destination.
Maxim Levitsky Dec. 2, 2021, 10:47 a.m. UTC | #14
On Thu, 2021-12-02 at 12:20 +0200, Maxim Levitsky wrote:
> On Mon, 2021-11-29 at 17:25 +0000, Sean Christopherson wrote:
> > On Mon, Nov 29, 2021, Maxim Levitsky wrote:
> > > (This thing is that when you tell the IOMMU that a vCPU is not running,
> > > Another thing I discovered that this patch series totally breaks my VMs,
> > > without cpu_pm=on The whole series (I didn't yet bisect it) makes even my
> > > fedora32 VM be very laggy, almost unusable, and it only has one
> > > passed-through device, a nic).
> > 
> > Grrrr, the complete lack of comments in the KVM code and the separate paths for
> > VMX vs SVM when handling HLT with APICv make this all way for difficult to
> > understand than it should be.
> > 
> > The hangs are likely due to:
> > 
> >   KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv)
> > 
> > If a posted interrupt arrives after KVM has done its final search through the vIRR,
> > but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will
> > be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log.
> > 
> > I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding
> > notification after switching to the wakeup vector.
> > 
> > For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks.
> > Unlike VMX's PI support, there's no fast check for an interrupt being posted (KVM
> > would have to rewalk the vIRR), no easy to signal the current CPU to do wakeup (I
> > don't think KVM even has access to the IRQ used by the owning IOMMU), and there's
> > no simplification of load/put code.
> 
> I have an idea.
>  
> Why do we even use/need the GA log?
> Why not, just disable the 'guest mode' in the iommu and let it sent good old normal interrupt
> when a vCPU is not running, just like we do when we inhibit the AVIC?
>  
> GA log makes all devices that share an iommu (there are 4 iommus per package these days,
> some without useful devices) go through a single (!) msi like interrupt,
> which is even for some reason implemented by a threaded IRQ in the linux kernel.


Yep, this gross hack works!


diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 958966276d00b8..6136b94f6b5f5e 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -987,8 +987,9 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
                entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 
        WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
-       avic_update_iommu_vcpu_affinity(vcpu, h_physical_id,
-                                       svm->avic_is_running);
+
+       svm_set_pi_irte_mode(vcpu, svm->avic_is_running);
+       avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
 }
 
 void avic_vcpu_put(struct kvm_vcpu *vcpu)
@@ -997,8 +998,9 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
        struct vcpu_svm *svm = to_svm(vcpu);
 
        entry = READ_ONCE(*(svm->avic_physical_id_cache));
-       if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
-               avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
+       if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK) {
+               svm_set_pi_irte_mode(vcpu, false);
+       }
 
        entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
        WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
> 


GA log interrupts almost gone (there are still few because svm_set_pi_irte_mode sets is_running false)
devices works as expected sending normal interrupts unless guest is loaded, then normal interrupts disappear,
as expected.

Best regards,
	Maxim Levitsky

>  
> Best regards,
> 	Maxim Levitsky
> 
> > If the scheduler were changed to support waking in the sched_out path, then I'd be
> > more inclined to handle this in avic_vcpu_put() by rewalking the vIRR one final
> > time, but for now it's not worth it.
> > 
> > > If I apply though only the patch series up to this patch, my fedora VM seems
> > > to work fine, but my windows VM still locks up hard when I run 'LatencyTop'
> > > in it, which doesn't happen without this patch.
> > 
> > Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest?
> > The only search results I can find for LatencyTop are Linux specific.
> > 
> > > So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt
> > > (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to
> > > have different values), and IRR has plenty of interrupts with lower priority.
> > > The VM seems to be stuck in this case. As if its EOI got lost or something is
> > > preventing the IRQ handler from issuing EOI.
> > >  
> > > LatencyTop does install some form of a kernel driver which likely does meddle
> > > with interrupts (maybe it sends lots of self IPIs?).
> > >  
> > > 100% reproducible as soon as I start monitoring with LatencyTop.
> > >  
> > > Without this patch it works (or if disabling halt polling),
> > 
> > Huh.  I assume everything works if you disable halt polling _without_ this patch
> > applied?
> > 
> > If so, that implies that successful halt polling without mucking with vCPU IOMMU
> > affinity is somehow problematic.  I can't think of any relevant side effects other
> > than timing.
> >
Maxim Levitsky Dec. 2, 2021, 12:02 p.m. UTC | #15
On Mon, 2021-11-29 at 17:25 +0000, Sean Christopherson wrote:
> On Mon, Nov 29, 2021, Maxim Levitsky wrote:
> > (This thing is that when you tell the IOMMU that a vCPU is not running,
> > Another thing I discovered that this patch series totally breaks my VMs,
> > without cpu_pm=on The whole series (I didn't yet bisect it) makes even my
> > fedora32 VM be very laggy, almost unusable, and it only has one
> > passed-through device, a nic).
> 
> Grrrr, the complete lack of comments in the KVM code and the separate paths for
> VMX vs SVM when handling HLT with APICv make this all way for difficult to
> understand than it should be.
> 
> The hangs are likely due to:
> 
>   KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv)

Yes, the other hang I told about which makes all my VMs very laggy, almost impossible
to use is because of the above patch, but since I reproduced it now again without
any passed-through device, I also blame the cpu errata on this.

Best regards,
	Maxim Levitsky


> 
> If a posted interrupt arrives after KVM has done its final search through the vIRR,
> but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will
> be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log.
> 
> I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding
> notification after switching to the wakeup vector.
> 
> For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks.
> Unlike VMX's PI support, there's no fast check for an interrupt being posted (KVM
> would have to rewalk the vIRR), no easy to signal the current CPU to do wakeup (I
> don't think KVM even has access to the IRQ used by the owning IOMMU), and there's
> no simplification of load/put code.
> 
> If the scheduler were changed to support waking in the sched_out path, then I'd be
> more inclined to handle this in avic_vcpu_put() by rewalking the vIRR one final
> time, but for now it's not worth it.
> 
> > If I apply though only the patch series up to this patch, my fedora VM seems
> > to work fine, but my windows VM still locks up hard when I run 'LatencyTop'
> > in it, which doesn't happen without this patch.
> 
> Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest?
> The only search results I can find for LatencyTop are Linux specific.
> 
> > So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt
> > (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to
> > have different values), and IRR has plenty of interrupts with lower priority.
> > The VM seems to be stuck in this case. As if its EOI got lost or something is
> > preventing the IRQ handler from issuing EOI.
> >  
> > LatencyTop does install some form of a kernel driver which likely does meddle
> > with interrupts (maybe it sends lots of self IPIs?).
> >  
> > 100% reproducible as soon as I start monitoring with LatencyTop.
> >  
> > Without this patch it works (or if disabling halt polling),
> 
> Huh.  I assume everything works if you disable halt polling _without_ this patch
> applied?
> 
> If so, that implies that successful halt polling without mucking with vCPU IOMMU
> affinity is somehow problematic.  I can't think of any relevant side effects other
> than timing.
>
diff mbox series

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f90b3ed05628..227f6bbe0716 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3235,8 +3235,6 @@  void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 	bool waited = false;
 	u64 block_ns;
 
-	kvm_arch_vcpu_blocking(vcpu);
-
 	start = cur = poll_end = ktime_get();
 	if (do_halt_poll) {
 		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
@@ -3253,6 +3251,7 @@  void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 		} while (kvm_vcpu_can_poll(cur, stop));
 	}
 
+	kvm_arch_vcpu_blocking(vcpu);
 
 	prepare_to_rcuwait(wait);
 	for (;;) {
@@ -3265,6 +3264,9 @@  void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 		schedule();
 	}
 	finish_rcuwait(wait);
+
+	kvm_arch_vcpu_unblocking(vcpu);
+
 	cur = ktime_get();
 	if (waited) {
 		vcpu->stat.generic.halt_wait_ns +=
@@ -3273,7 +3275,6 @@  void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 				ktime_to_ns(cur) - ktime_to_ns(poll_end));
 	}
 out:
-	kvm_arch_vcpu_unblocking(vcpu);
 	block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
 
 	/*