Message ID | 20190401133659.20421-1-vkuznets@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ioapic: allow buggy guests mishandling level-triggered interrupts to make progress | expand |
On 01/04/19 15:36, Vitaly Kuznetsov wrote: > It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + > piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that > it is mishandling level-triggered interrupt performing EOI without fixing > the root cause. This causes immediate re-assertion and L2 VM (which is > supposedly expected to fix the cause of the interrupt) is not making any > progress. > > Gory details: https://www.spinics.net/lists/kvm/msg184484.html > > Turns out we were dealing with similar issues before; in-kernel IOAPIC > implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay > irq delivery duringeoi broadcast") which describes a very similar issue. > > Steal the idea from the above mentioned commit for IOAPIC implementation in > QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well. > > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> > --- > hw/intc/ioapic.c | 43 ++++++++++++++++++++++++++++++- > hw/intc/trace-events | 1 + > include/hw/i386/ioapic_internal.h | 3 +++ > 3 files changed, 46 insertions(+), 1 deletion(-) > > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c > index 9d75f84d3b..daf45cc8a8 100644 > --- a/hw/intc/ioapic.c > +++ b/hw/intc/ioapic.c > @@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s) > } > } > > +#define SUCCESSIVE_IRQ_MAX_COUNT 10000 > + > +static void ioapic_timer(void *opaque) > +{ > + IOAPICCommonState *s = opaque; > + > + ioapic_service(s); > +} > + > static void ioapic_set_irq(void *opaque, int vector, int level) > { > IOAPICCommonState *s = opaque; > @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector) > trace_ioapic_clear_remote_irr(n, vector); > s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR; > if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) { > - ioapic_service(s); > + bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) > + == IOAPIC_TRIGGER_LEVEL; > + > + ++s->irq_reassert[vector]; > + if (!level || > + s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) { > + ioapic_service(s); > + } else { > + /* > + * Real hardware does not deliver the interrupt > + * immediately during eoi broadcast, and this lets a > + * buggy guest make slow progress even if it does not > + * correctly handle a level-triggered interrupt. Emulate > + * this behavior if we detect an interrupt storm. > + */ > + trace_ioapic_eoi_delayed_reassert(vector); > + timer_mod(s->timer, > + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + > + NANOSECONDS_PER_SECOND / 100); Should this be done only if the timer isn't pending? Paolo
> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov <vkuznets@redhat.com> wrote: > > It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + > piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that > it is mishandling level-triggered interrupt performing EOI without fixing > the root cause. I would rephrase as: It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + piix4-usb-uhci) hangs on boot. Root-cause was that one of Hyper-V level-triggered interrupt handler performs EOI before fixing the root-cause. This results in IOAPIC keep re-raising the level-triggered interrupt after EOI because irq-line remains asserted. > This causes immediate re-assertion and L2 VM (which is > supposedly expected to fix the cause of the interrupt) is not making any > progress. I don’t know why you assume this. From the trace we have examined, it seems that the EOI is performed by Hyper-V and not it’s guest This means that the handler for this level-triggered interrupt is on Hyper-V and not it’s guest. > > Gory details: https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg184484.html&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Q0Ico0Nb_DGRDrDgXkjkRr-xjzIbOLteVOhDJXBD_pU&s=d_H4_-qzqGvyi8X7g_KA0hZ5a8zjfHQhe1BhLPIokcA&e= Maybe worth to note that one should read the entire thread to understand the analysis. > > Turns out we were dealing with similar issues before; in-kernel IOAPIC > implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay > irq delivery duringeoi broadcast") which describes a very similar issue. > > Steal the idea from the above mentioned commit for IOAPIC implementation in > QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well. > > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> > --- > hw/intc/ioapic.c | 43 ++++++++++++++++++++++++++++++- > hw/intc/trace-events | 1 + > include/hw/i386/ioapic_internal.h | 3 +++ > 3 files changed, 46 insertions(+), 1 deletion(-) > > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c > index 9d75f84d3b..daf45cc8a8 100644 > --- a/hw/intc/ioapic.c > +++ b/hw/intc/ioapic.c > @@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s) > } > } > > +#define SUCCESSIVE_IRQ_MAX_COUNT 10000 > + > +static void ioapic_timer(void *opaque) > +{ > + IOAPICCommonState *s = opaque; > + > + ioapic_service(s); > +} > + > static void ioapic_set_irq(void *opaque, int vector, int level) > { > IOAPICCommonState *s = opaque; > @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector) > trace_ioapic_clear_remote_irr(n, vector); > s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR; This clear of remote-irr should happen only for level-triggered interrupts. So we can make the code here more structured like KVM’s __kvm_ioapic_update_eoi(). It also have an extra-value of not advancing "s->irq_reassert[vector]” for vectors which are edge-triggered which is a bit misleading. > if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) { > - ioapic_service(s); > + bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) > + == IOAPIC_TRIGGER_LEVEL; Nit: I would declare variable “trig_mode” and compare it later explicitly to IOAPIC_TRIGGER_LEVEL. > + > + ++s->irq_reassert[vector]; Nit: I would rename “irq_reassert” to “irq_eoi” as it is named in KVM IOAPIC code. > + if (!level || > + s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) { > + ioapic_service(s); > + } else { > + /* > + * Real hardware does not deliver the interrupt > + * immediately during eoi broadcast, and this lets a > + * buggy guest make slow progress even if it does not > + * correctly handle a level-triggered interrupt. Emulate > + * this behavior if we detect an interrupt storm. > + */ > + trace_ioapic_eoi_delayed_reassert(vector); > + timer_mod(s->timer, > + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + > + NANOSECONDS_PER_SECOND / 100); You need to zero here s->irq_reassert[vector]. I think you would avoid these kind of little mistakes if you will attempt to follow KVM’s commit code structure. > + } > + } else { > + s->irq_reassert[vector] = 0; > } > } > } > @@ -401,6 +431,8 @@ static void ioapic_realize(DeviceState *dev, Error **errp) > memory_region_init_io(&s->io_memory, OBJECT(s), &ioapic_io_ops, s, > "ioapic", 0x1000); > > + s->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, ioapic_timer, s); > + > qdev_init_gpio_in(dev, ioapic_set_irq, IOAPIC_NUM_PINS); > > ioapics[ioapic_no] = s; > @@ -408,6 +440,14 @@ static void ioapic_realize(DeviceState *dev, Error **errp) > qemu_add_machine_init_done_notifier(&s->machine_done); > } > > +static void ioapic_unrealize(DeviceState *dev, Error **errp) > +{ > + IOAPICCommonState *s = IOAPIC_COMMON(dev); > + > + timer_del(s->timer); > + timer_free(s->timer); > +} > + > static Property ioapic_properties[] = { > DEFINE_PROP_UINT8("version", IOAPICCommonState, version, IOAPIC_VER_DEF), > DEFINE_PROP_END_OF_LIST(), > @@ -419,6 +459,7 @@ static void ioapic_class_init(ObjectClass *klass, void *data) > DeviceClass *dc = DEVICE_CLASS(klass); > > k->realize = ioapic_realize; > + k->unrealize = ioapic_unrealize; > /* > * If APIC is in kernel, we need to update the kernel cache after > * migration, otherwise first 24 gsi routes will be invalid. > diff --git a/hw/intc/trace-events b/hw/intc/trace-events > index a28bdce925..90c9d07c1a 100644 > --- a/hw/intc/trace-events > +++ b/hw/intc/trace-events > @@ -25,6 +25,7 @@ apic_mem_writel(uint64_t addr, uint32_t val) "0x%"PRIx64" = 0x%08x" > ioapic_set_remote_irr(int n) "set remote irr for pin %d" > ioapic_clear_remote_irr(int n, int vector) "clear remote irr for pin %d vector %d" > ioapic_eoi_broadcast(int vector) "EOI broadcast for vector %d" > +ioapic_eoi_delayed_reassert(int vector) "delayed reassert on EOI broadcast for vector %d" > ioapic_mem_read(uint8_t addr, uint8_t regsel, uint8_t size, uint32_t val) "ioapic mem read addr 0x%"PRIx8" regsel: 0x%"PRIx8" size 0x%"PRIx8" retval 0x%"PRIx32 > ioapic_mem_write(uint8_t addr, uint8_t regsel, uint8_t size, uint32_t val) "ioapic mem write addr 0x%"PRIx8" regsel: 0x%"PRIx8" size 0x%"PRIx8" val 0x%"PRIx32 > ioapic_set_irq(int vector, int level) "vector: %d level: %d" > diff --git a/include/hw/i386/ioapic_internal.h b/include/hw/i386/ioapic_internal.h > index 9848f391bb..e0ee88db40 100644 > --- a/include/hw/i386/ioapic_internal.h > +++ b/include/hw/i386/ioapic_internal.h > @@ -96,6 +96,7 @@ typedef struct IOAPICCommonClass { > SysBusDeviceClass parent_class; > > DeviceRealize realize; > + DeviceUnrealize unrealize; > void (*pre_save)(IOAPICCommonState *s); > void (*post_load)(IOAPICCommonState *s); > } IOAPICCommonClass; > @@ -111,6 +112,8 @@ struct IOAPICCommonState { > uint8_t version; > uint64_t irq_count[IOAPIC_NUM_PINS]; > int irq_level[IOAPIC_NUM_PINS]; > + int irq_reassert[IOAPIC_NUM_PINS]; > + QEMUTimer *timer; > }; > > void ioapic_reset_common(DeviceState *dev); > -- > 2.20.1 >
Paolo Bonzini <pbonzini@redhat.com> writes: > On 01/04/19 15:36, Vitaly Kuznetsov wrote: ... >> static void ioapic_set_irq(void *opaque, int vector, int level) >> { >> IOAPICCommonState *s = opaque; >> @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector) >> trace_ioapic_clear_remote_irr(n, vector); >> s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR; >> if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) { >> - ioapic_service(s); >> + bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) >> + == IOAPIC_TRIGGER_LEVEL; >> + >> + ++s->irq_reassert[vector]; >> + if (!level || >> + s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) { >> + ioapic_service(s); >> + } else { >> + /* >> + * Real hardware does not deliver the interrupt >> + * immediately during eoi broadcast, and this lets a >> + * buggy guest make slow progress even if it does not >> + * correctly handle a level-triggered interrupt. Emulate >> + * this behavior if we detect an interrupt storm. >> + */ >> + trace_ioapic_eoi_delayed_reassert(vector); >> + timer_mod(s->timer, >> + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + >> + NANOSECONDS_PER_SECOND / 100); > > Should this be done only if the timer isn't pending? Hm, maybe ... but how can this happen? To get here we need remote IRR bit and we clear it so someone needs to re-set it. The source won't probably be doing this (it is a level-triggered interrupt and it is already pending - why re-asserting?) but even if it does ioapic_service(s) will be called and when our timer fires we will just do nothing (consequitive ioapic_service() doesn't hurt).
On 01/04/19 17:13, Vitaly Kuznetsov wrote: >>> + trace_ioapic_eoi_delayed_reassert(vector); >>> + timer_mod(s->timer, >>> + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + >>> + NANOSECONDS_PER_SECOND / 100); >> Should this be done only if the timer isn't pending? > Hm, maybe ... but how can this happen? To get here we need remote IRR > bit and we clear it so someone needs to re-set it. The source won't > probably be doing this (it is a level-triggered interrupt and it is > already pending - why re-asserting?) but even if it does > ioapic_service(s) will be called and when our timer fires we will just > do nothing (consequitive ioapic_service() doesn't hurt). Yeah, it's just for cleanliness. Let's change it to timer_mod_anticipate and call it a day. Paolo
Liran Alon <liran.alon@oracle.com> writes: >> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov <vkuznets@redhat.com> wrote: >> >> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + >> piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that >> it is mishandling level-triggered interrupt performing EOI without fixing >> the root cause. > > I would rephrase as: > It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + piix4-usb-uhci) hangs on boot. > Root-cause was that one of Hyper-V level-triggered interrupt handler performs EOI before fixing the root-cause. > This results in IOAPIC keep re-raising the level-triggered interrupt > after EOI because irq-line remains asserted. Ok, thanks for the suggestion. > >> This causes immediate re-assertion and L2 VM (which is >> supposedly expected to fix the cause of the interrupt) is not making any >> progress. > > I don’t know why you assume this. > From the trace we have examined, it seems that the EOI is performed by Hyper-V and not it’s guest > This means that the handler for this level-triggered interrupt is on > Hyper-V and not it’s guest. If you let it run (with e.g. this patch or by setting preemtion timer > 0) you'll see that MMIO write fixing the cause of the interrupt is happening from L2: (qemu) info pci: Bus 0, device 4, function 0: USB controller: PCI device 8086:7112 PCI subsystem 1af4:1100 IRQ 23. BAR4: I/O at 0x6060 [0x607f]. id "" ... 538597.212494: kvm_exit: reason VMRESUME rip 0xfffff80004250115 info 0 0 538597.212499: kvm_entry: vcpu 0 538597.212506: kvm_exit: reason IO_INSTRUCTION rip 0xfffff80e02ac6a27 info 60620009 0 538597.212507: kvm_nested_vmexit: rip fffff80e02ac6a27 reason IO_INSTRUCTION info1 60620009 info2 0 int_info 0 int_info_err 0 538597.212509: kvm_fpu: unload 538597.212511: kvm_userspace_exit: reason KVM_EXIT_IO (2) 538597.212516: kvm_fpu: load 538597.212518: kvm_pio: pio_read at 0x6062 size 2 count 1 val 0x1 538597.212519: kvm_entry: vcpu 0 538597.212523: kvm_exit: reason IO_INSTRUCTION rip 0xfffff80e02ac6a61 info 60640009 0 538597.212523: kvm_nested_vmexit: rip fffff80e02ac6a61 reason IO_INSTRUCTION info1 60640009 info2 0 int_info 0 int_info_err 0 538597.212524: kvm_fpu: unload 538597.212525: kvm_userspace_exit: reason KVM_EXIT_IO (2) 538597.212528: kvm_fpu: load 538597.212528: kvm_pio: pio_read at 0x6064 size 2 count 1 val 0xf ... and this happens after EOI from L1. > >> >> Gory details: https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg184484.html&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Q0Ico0Nb_DGRDrDgXkjkRr-xjzIbOLteVOhDJXBD_pU&s=d_H4_-qzqGvyi8X7g_KA0hZ5a8zjfHQhe1BhLPIokcA&e= > > Maybe worth to note that one should read the entire thread to understand the analysis. > Sure. >> >> Turns out we were dealing with similar issues before; in-kernel IOAPIC >> implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay >> irq delivery duringeoi broadcast") which describes a very similar issue. >> >> Steal the idea from the above mentioned commit for IOAPIC implementation in >> QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well. >> >> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> >> --- >> hw/intc/ioapic.c | 43 ++++++++++++++++++++++++++++++- >> hw/intc/trace-events | 1 + >> include/hw/i386/ioapic_internal.h | 3 +++ >> 3 files changed, 46 insertions(+), 1 deletion(-) >> >> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c >> index 9d75f84d3b..daf45cc8a8 100644 >> --- a/hw/intc/ioapic.c >> +++ b/hw/intc/ioapic.c >> @@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s) >> } >> } >> >> +#define SUCCESSIVE_IRQ_MAX_COUNT 10000 >> + >> +static void ioapic_timer(void *opaque) >> +{ >> + IOAPICCommonState *s = opaque; >> + >> + ioapic_service(s); >> +} >> + >> static void ioapic_set_irq(void *opaque, int vector, int level) >> { >> IOAPICCommonState *s = opaque; >> @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector) >> trace_ioapic_clear_remote_irr(n, vector); >> s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR; > > This clear of remote-irr should happen only for level-triggered interrupts. > So we can make the code here more structured like KVM’s __kvm_ioapic_update_eoi(). > It also have an extra-value of not advancing "s->irq_reassert[vector]” for vectors which are edge-triggered which is a bit misleading. > >> if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) { >> - ioapic_service(s); >> + bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) >> + == IOAPIC_TRIGGER_LEVEL; > > Nit: I would declare variable “trig_mode” and compare it later > explicitly to IOAPIC_TRIGGER_LEVEL. Ok. > >> + >> + ++s->irq_reassert[vector]; > > Nit: I would rename “irq_reassert” to “irq_eoi” as it is named in KVM > IOAPIC code. Honestly, I did this change on purpose: we're counting how many times the irq was re-asserted on eoi and 'irq_eoi' sounds more like an EOI flag to me. > >> + if (!level || >> + s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) { >> + ioapic_service(s); >> + } else { >> + /* >> + * Real hardware does not deliver the interrupt >> + * immediately during eoi broadcast, and this lets a >> + * buggy guest make slow progress even if it does not >> + * correctly handle a level-triggered interrupt. Emulate >> + * this behavior if we detect an interrupt storm. >> + */ >> + trace_ioapic_eoi_delayed_reassert(vector); >> + timer_mod(s->timer, >> + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + >> + NANOSECONDS_PER_SECOND / 100); > > You need to zero here s->irq_reassert[vector]. > I think you would avoid these kind of little mistakes if you will > attempt to follow KVM’s commit code structure. Oh, I borrowed the idea for the commit and you'd like me to borrow the commit itself :-) We can, of couse, zero s->irq_reassert[vector] here, however, that would change the behavior: with my patch we will be delaying all re-assertions after SUCCESSIVE_IRQ_MAX_COUNT until the line is lowered, not just one. I'm not sure which is better. Anyway, I checked and the issue is gone even if we delay once and you're right that there's no reason to differ from kernel commit. Will fix in v2, thanks! > >> + } >> + } else { >> + s->irq_reassert[vector] = 0; >> } >> } >> } >> @@ -401,6 +431,8 @@ static void ioapic_realize(DeviceState *dev, Error **errp) >> memory_region_init_io(&s->io_memory, OBJECT(s), &ioapic_io_ops, s, >> "ioapic", 0x1000); >> >> + s->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, ioapic_timer, s); >> + >> qdev_init_gpio_in(dev, ioapic_set_irq, IOAPIC_NUM_PINS); >> >> ioapics[ioapic_no] = s; >> @@ -408,6 +440,14 @@ static void ioapic_realize(DeviceState *dev, Error **errp) >> qemu_add_machine_init_done_notifier(&s->machine_done); >> } >> >> +static void ioapic_unrealize(DeviceState *dev, Error **errp) >> +{ >> + IOAPICCommonState *s = IOAPIC_COMMON(dev); >> + >> + timer_del(s->timer); >> + timer_free(s->timer); >> +} >> + >> static Property ioapic_properties[] = { >> DEFINE_PROP_UINT8("version", IOAPICCommonState, version, IOAPIC_VER_DEF), >> DEFINE_PROP_END_OF_LIST(), >> @@ -419,6 +459,7 @@ static void ioapic_class_init(ObjectClass *klass, void *data) >> DeviceClass *dc = DEVICE_CLASS(klass); >> >> k->realize = ioapic_realize; >> + k->unrealize = ioapic_unrealize; >> /* >> * If APIC is in kernel, we need to update the kernel cache after >> * migration, otherwise first 24 gsi routes will be invalid. >> diff --git a/hw/intc/trace-events b/hw/intc/trace-events >> index a28bdce925..90c9d07c1a 100644 >> --- a/hw/intc/trace-events >> +++ b/hw/intc/trace-events >> @@ -25,6 +25,7 @@ apic_mem_writel(uint64_t addr, uint32_t val) "0x%"PRIx64" = 0x%08x" >> ioapic_set_remote_irr(int n) "set remote irr for pin %d" >> ioapic_clear_remote_irr(int n, int vector) "clear remote irr for pin %d vector %d" >> ioapic_eoi_broadcast(int vector) "EOI broadcast for vector %d" >> +ioapic_eoi_delayed_reassert(int vector) "delayed reassert on EOI broadcast for vector %d" >> ioapic_mem_read(uint8_t addr, uint8_t regsel, uint8_t size, uint32_t val) "ioapic mem read addr 0x%"PRIx8" regsel: 0x%"PRIx8" size 0x%"PRIx8" retval 0x%"PRIx32 >> ioapic_mem_write(uint8_t addr, uint8_t regsel, uint8_t size, uint32_t val) "ioapic mem write addr 0x%"PRIx8" regsel: 0x%"PRIx8" size 0x%"PRIx8" val 0x%"PRIx32 >> ioapic_set_irq(int vector, int level) "vector: %d level: %d" >> diff --git a/include/hw/i386/ioapic_internal.h b/include/hw/i386/ioapic_internal.h >> index 9848f391bb..e0ee88db40 100644 >> --- a/include/hw/i386/ioapic_internal.h >> +++ b/include/hw/i386/ioapic_internal.h >> @@ -96,6 +96,7 @@ typedef struct IOAPICCommonClass { >> SysBusDeviceClass parent_class; >> >> DeviceRealize realize; >> + DeviceUnrealize unrealize; >> void (*pre_save)(IOAPICCommonState *s); >> void (*post_load)(IOAPICCommonState *s); >> } IOAPICCommonClass; >> @@ -111,6 +112,8 @@ struct IOAPICCommonState { >> uint8_t version; >> uint64_t irq_count[IOAPIC_NUM_PINS]; >> int irq_level[IOAPIC_NUM_PINS]; >> + int irq_reassert[IOAPIC_NUM_PINS]; >> + QEMUTimer *timer; >> }; >> >> void ioapic_reset_common(DeviceState *dev); >> -- >> 2.20.1 >> >
> On 1 Apr 2019, at 18:58, Vitaly Kuznetsov <vkuznets@redhat.com> wrote: > > Liran Alon <liran.alon@oracle.com> writes: > >>> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov <vkuznets@redhat.com> wrote: >>> >>> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + >>> piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that >>> it is mishandling level-triggered interrupt performing EOI without fixing >>> the root cause. >> >> I would rephrase as: >> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + piix4-usb-uhci) hangs on boot. >> Root-cause was that one of Hyper-V level-triggered interrupt handler performs EOI before fixing the root-cause. >> This results in IOAPIC keep re-raising the level-triggered interrupt >> after EOI because irq-line remains asserted. > > Ok, thanks for the suggestion. > >> >>> This causes immediate re-assertion and L2 VM (which is >>> supposedly expected to fix the cause of the interrupt) is not making any >>> progress. >> >> I don’t know why you assume this. >> From the trace we have examined, it seems that the EOI is performed by Hyper-V and not it’s guest >> This means that the handler for this level-triggered interrupt is on >> Hyper-V and not it’s guest. > > If you let it run (with e.g. this patch or by setting preemtion timer > > 0) you'll see that MMIO write fixing the cause of the interrupt is > happening from L2: > > (qemu) info pci: > > Bus 0, device 4, function 0: > USB controller: PCI device 8086:7112 > PCI subsystem 1af4:1100 > IRQ 23. > BAR4: I/O at 0x6060 [0x607f]. > id "" > > ... > 538597.212494: kvm_exit: reason VMRESUME rip 0xfffff80004250115 info 0 0 > 538597.212499: kvm_entry: vcpu 0 > 538597.212506: kvm_exit: reason IO_INSTRUCTION rip 0xfffff80e02ac6a27 info 60620009 0 > 538597.212507: kvm_nested_vmexit: rip fffff80e02ac6a27 reason IO_INSTRUCTION info1 60620009 info2 0 int_info 0 int_info_err 0 > 538597.212509: kvm_fpu: unload > 538597.212511: kvm_userspace_exit: reason KVM_EXIT_IO (2) > 538597.212516: kvm_fpu: load > 538597.212518: kvm_pio: pio_read at 0x6062 size 2 count 1 val 0x1 > 538597.212519: kvm_entry: vcpu 0 > 538597.212523: kvm_exit: reason IO_INSTRUCTION rip 0xfffff80e02ac6a61 info 60640009 0 > 538597.212523: kvm_nested_vmexit: rip fffff80e02ac6a61 reason IO_INSTRUCTION info1 60640009 info2 0 int_info 0 int_info_err 0 > 538597.212524: kvm_fpu: unload > 538597.212525: kvm_userspace_exit: reason KVM_EXIT_IO (2) > 538597.212528: kvm_fpu: load > 538597.212528: kvm_pio: pio_read at 0x6064 size 2 count 1 val 0xf > ... > > and this happens after EOI from L1. I see that the L2 guest is doing I/O read to the device BAR4 but do these reads lower the irq-line? I would expect a write to lower the irq-line. Looking at uhci_port_read(), it seems that offset 0x02 and 0x04 just return a value. Doesn’t lower irq-line. (Even though offset 0x04 returns the "interrupt enable register”). In contrast, looking at uhci_port_write(), it seems that writing to either offset 0x02 or 0x04 could lower the irq-line. So you should look for pio_write to port 0x6062 or 0x6064 to see who is actually responsible for lowering the irq-line. Having said that, it does make sense that the layer (L1/L2) who interacts in any way with the device’s BAR is probably also the one who lowers the irq-line. I just fine it hard to believe that on a real physical hardware with this device, that there will be no issue. i.e. That hardware IOAPIC will delay the re-injection of the level-triggered interrupt (since the IOAPIC EOI) by so much that Hyper-V can make it to resume into it’s guest before it is injected. Maybe Hyper-V was never tested with such a real device? ;) > >> >>> >>> Gory details: https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg184484.html&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Q0Ico0Nb_DGRDrDgXkjkRr-xjzIbOLteVOhDJXBD_pU&s=d_H4_-qzqGvyi8X7g_KA0hZ5a8zjfHQhe1BhLPIokcA&e= >> >> Maybe worth to note that one should read the entire thread to understand the analysis. >> > > Sure. > >>> >>> Turns out we were dealing with similar issues before; in-kernel IOAPIC >>> implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay >>> irq delivery duringeoi broadcast") which describes a very similar issue. >>> >>> Steal the idea from the above mentioned commit for IOAPIC implementation in >>> QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well. >>> >>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> >>> --- >>> hw/intc/ioapic.c | 43 ++++++++++++++++++++++++++++++- >>> hw/intc/trace-events | 1 + >>> include/hw/i386/ioapic_internal.h | 3 +++ >>> 3 files changed, 46 insertions(+), 1 deletion(-) >>> >>> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c >>> index 9d75f84d3b..daf45cc8a8 100644 >>> --- a/hw/intc/ioapic.c >>> +++ b/hw/intc/ioapic.c >>> @@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s) >>> } >>> } >>> >>> +#define SUCCESSIVE_IRQ_MAX_COUNT 10000 >>> + >>> +static void ioapic_timer(void *opaque) >>> +{ >>> + IOAPICCommonState *s = opaque; >>> + >>> + ioapic_service(s); >>> +} >>> + >>> static void ioapic_set_irq(void *opaque, int vector, int level) >>> { >>> IOAPICCommonState *s = opaque; >>> @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector) >>> trace_ioapic_clear_remote_irr(n, vector); >>> s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR; >> >> This clear of remote-irr should happen only for level-triggered interrupts. >> So we can make the code here more structured like KVM’s __kvm_ioapic_update_eoi(). >> It also have an extra-value of not advancing "s->irq_reassert[vector]” for vectors which are edge-triggered which is a bit misleading. >> >>> if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) { >>> - ioapic_service(s); >>> + bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) >>> + == IOAPIC_TRIGGER_LEVEL; >> >> Nit: I would declare variable “trig_mode” and compare it later >> explicitly to IOAPIC_TRIGGER_LEVEL. > > Ok. > >> >>> + >>> + ++s->irq_reassert[vector]; >> >> Nit: I would rename “irq_reassert” to “irq_eoi” as it is named in KVM >> IOAPIC code. > > Honestly, I did this change on purpose: we're counting how many times the > irq was re-asserted on eoi and 'irq_eoi' sounds more like an EOI flag to > me. Ok. :) > >> >>> + if (!level || >>> + s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) { >>> + ioapic_service(s); >>> + } else { >>> + /* >>> + * Real hardware does not deliver the interrupt >>> + * immediately during eoi broadcast, and this lets a >>> + * buggy guest make slow progress even if it does not >>> + * correctly handle a level-triggered interrupt. Emulate >>> + * this behavior if we detect an interrupt storm. >>> + */ >>> + trace_ioapic_eoi_delayed_reassert(vector); >>> + timer_mod(s->timer, >>> + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + >>> + NANOSECONDS_PER_SECOND / 100); >> >> You need to zero here s->irq_reassert[vector]. >> I think you would avoid these kind of little mistakes if you will >> attempt to follow KVM’s commit code structure. > > Oh, I borrowed the idea for the commit and you'd like me to borrow the > commit itself :-) Yep :P > > We can, of couse, zero s->irq_reassert[vector] here, however, that would > change the behavior: with my patch we will be delaying all re-assertions > after SUCCESSIVE_IRQ_MAX_COUNT until the line is lowered, not just > one. I'm not sure which is better. > > Anyway, I checked and the issue is gone even if we delay once and you're > right that there's no reason to differ from kernel commit. Will fix in > v2, thanks! Thanks, -Liran > >> >>> + } >>> + } else { >>> + s->irq_reassert[vector] = 0; >>> } >>> } >>> } >>> @@ -401,6 +431,8 @@ static void ioapic_realize(DeviceState *dev, Error **errp) >>> memory_region_init_io(&s->io_memory, OBJECT(s), &ioapic_io_ops, s, >>> "ioapic", 0x1000); >>> >>> + s->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, ioapic_timer, s); >>> + >>> qdev_init_gpio_in(dev, ioapic_set_irq, IOAPIC_NUM_PINS); >>> >>> ioapics[ioapic_no] = s; >>> @@ -408,6 +440,14 @@ static void ioapic_realize(DeviceState *dev, Error **errp) >>> qemu_add_machine_init_done_notifier(&s->machine_done); >>> } >>> >>> +static void ioapic_unrealize(DeviceState *dev, Error **errp) >>> +{ >>> + IOAPICCommonState *s = IOAPIC_COMMON(dev); >>> + >>> + timer_del(s->timer); >>> + timer_free(s->timer); >>> +} >>> + >>> static Property ioapic_properties[] = { >>> DEFINE_PROP_UINT8("version", IOAPICCommonState, version, IOAPIC_VER_DEF), >>> DEFINE_PROP_END_OF_LIST(), >>> @@ -419,6 +459,7 @@ static void ioapic_class_init(ObjectClass *klass, void *data) >>> DeviceClass *dc = DEVICE_CLASS(klass); >>> >>> k->realize = ioapic_realize; >>> + k->unrealize = ioapic_unrealize; >>> /* >>> * If APIC is in kernel, we need to update the kernel cache after >>> * migration, otherwise first 24 gsi routes will be invalid. >>> diff --git a/hw/intc/trace-events b/hw/intc/trace-events >>> index a28bdce925..90c9d07c1a 100644 >>> --- a/hw/intc/trace-events >>> +++ b/hw/intc/trace-events >>> @@ -25,6 +25,7 @@ apic_mem_writel(uint64_t addr, uint32_t val) "0x%"PRIx64" = 0x%08x" >>> ioapic_set_remote_irr(int n) "set remote irr for pin %d" >>> ioapic_clear_remote_irr(int n, int vector) "clear remote irr for pin %d vector %d" >>> ioapic_eoi_broadcast(int vector) "EOI broadcast for vector %d" >>> +ioapic_eoi_delayed_reassert(int vector) "delayed reassert on EOI broadcast for vector %d" >>> ioapic_mem_read(uint8_t addr, uint8_t regsel, uint8_t size, uint32_t val) "ioapic mem read addr 0x%"PRIx8" regsel: 0x%"PRIx8" size 0x%"PRIx8" retval 0x%"PRIx32 >>> ioapic_mem_write(uint8_t addr, uint8_t regsel, uint8_t size, uint32_t val) "ioapic mem write addr 0x%"PRIx8" regsel: 0x%"PRIx8" size 0x%"PRIx8" val 0x%"PRIx32 >>> ioapic_set_irq(int vector, int level) "vector: %d level: %d" >>> diff --git a/include/hw/i386/ioapic_internal.h b/include/hw/i386/ioapic_internal.h >>> index 9848f391bb..e0ee88db40 100644 >>> --- a/include/hw/i386/ioapic_internal.h >>> +++ b/include/hw/i386/ioapic_internal.h >>> @@ -96,6 +96,7 @@ typedef struct IOAPICCommonClass { >>> SysBusDeviceClass parent_class; >>> >>> DeviceRealize realize; >>> + DeviceUnrealize unrealize; >>> void (*pre_save)(IOAPICCommonState *s); >>> void (*post_load)(IOAPICCommonState *s); >>> } IOAPICCommonClass; >>> @@ -111,6 +112,8 @@ struct IOAPICCommonState { >>> uint8_t version; >>> uint64_t irq_count[IOAPIC_NUM_PINS]; >>> int irq_level[IOAPIC_NUM_PINS]; >>> + int irq_reassert[IOAPIC_NUM_PINS]; >>> + QEMUTimer *timer; >>> }; >>> >>> void ioapic_reset_common(DeviceState *dev); >>> -- >>> 2.20.1 >>> >> > > -- > Vitaly
Liran Alon <liran.alon@oracle.com> writes: >> On 1 Apr 2019, at 18:58, Vitaly Kuznetsov <vkuznets@redhat.com> wrote: >> >> Liran Alon <liran.alon@oracle.com> writes: >> >>>> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov <vkuznets@redhat.com> wrote: >>>> >>>> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + >>>> piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that >>>> it is mishandling level-triggered interrupt performing EOI without fixing >>>> the root cause. >>> >>> I would rephrase as: >>> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + piix4-usb-uhci) hangs on boot. >>> Root-cause was that one of Hyper-V level-triggered interrupt handler performs EOI before fixing the root-cause. >>> This results in IOAPIC keep re-raising the level-triggered interrupt >>> after EOI because irq-line remains asserted. >> >> Ok, thanks for the suggestion. >> >>> >>>> This causes immediate re-assertion and L2 VM (which is >>>> supposedly expected to fix the cause of the interrupt) is not making any >>>> progress. >>> >>> I don’t know why you assume this. >>> From the trace we have examined, it seems that the EOI is performed by Hyper-V and not it’s guest >>> This means that the handler for this level-triggered interrupt is on >>> Hyper-V and not it’s guest. >> >> If you let it run (with e.g. this patch or by setting preemtion timer > >> 0) you'll see that MMIO write fixing the cause of the interrupt is >> happening from L2: >> >> (qemu) info pci: >> >> Bus 0, device 4, function 0: >> USB controller: PCI device 8086:7112 >> PCI subsystem 1af4:1100 >> IRQ 23. >> BAR4: I/O at 0x6060 [0x607f]. >> id "" >> >> ... >> 538597.212494: kvm_exit: reason VMRESUME rip 0xfffff80004250115 info 0 0 >> 538597.212499: kvm_entry: vcpu 0 >> 538597.212506: kvm_exit: reason IO_INSTRUCTION rip 0xfffff80e02ac6a27 info 60620009 0 >> 538597.212507: kvm_nested_vmexit: rip fffff80e02ac6a27 reason IO_INSTRUCTION info1 60620009 info2 0 int_info 0 int_info_err 0 >> 538597.212509: kvm_fpu: unload >> 538597.212511: kvm_userspace_exit: reason KVM_EXIT_IO (2) >> 538597.212516: kvm_fpu: load >> 538597.212518: kvm_pio: pio_read at 0x6062 size 2 count 1 val 0x1 >> 538597.212519: kvm_entry: vcpu 0 >> 538597.212523: kvm_exit: reason IO_INSTRUCTION rip 0xfffff80e02ac6a61 info 60640009 0 >> 538597.212523: kvm_nested_vmexit: rip fffff80e02ac6a61 reason IO_INSTRUCTION info1 60640009 info2 0 int_info 0 int_info_err 0 >> 538597.212524: kvm_fpu: unload >> 538597.212525: kvm_userspace_exit: reason KVM_EXIT_IO (2) >> 538597.212528: kvm_fpu: load >> 538597.212528: kvm_pio: pio_read at 0x6064 size 2 count 1 val 0xf >> ... >> >> and this happens after EOI from L1. > > I see that the L2 guest is doing I/O read to the device BAR4 but do these reads lower the irq-line? > I would expect a write to lower the irq-line. > > Looking at uhci_port_read(), it seems that offset 0x02 and 0x04 just return a value. Doesn’t lower irq-line. > (Even though offset 0x04 returns the "interrupt enable register”). > In contrast, looking at uhci_port_write(), it seems that writing to either offset 0x02 or 0x04 could lower the irq-line. > So you should look for pio_write to port 0x6062 or 0x6064 to see who > is actually responsible for lowering the irq-line. Sorry, I probably like trimming traces too much. Writes happen too: [005] 538597.212532: kvm_exit: reason IO_INSTRUCTION rip 0xfffff80e02ac6a8f info 60620001 0 [005] 538597.212533: kvm_nested_vmexit: rip fffff80e02ac6a8f reason IO_INSTRUCTION info1 60620001 info2 0 int_info 0 int_info_err 0 [005] 538597.212534: kvm_pio: pio_write at 0x6062 size 2 count 1 val 0x1 [005] 538597.212534: kvm_fpu: unload [005] 538597.212535: kvm_userspace_exit: reason KVM_EXIT_IO (2) [005] 538597.212543: kvm_fpu: load [005] 538597.212544: kvm_entry: vcpu 0 [005] 538597.212547: kvm_exit: reason IO_INSTRUCTION rip 0xfffff80e02ac6a9c info 60640001 0 [005] 538597.212548: kvm_nested_vmexit: rip fffff80e02ac6a9c reason IO_INSTRUCTION info1 60640001 info2 0 int_info 0 int_info_err 0 [005] 538597.212548: kvm_pio: pio_write at 0x6064 size 2 count 1 val 0x0 [005] 538597.212549: kvm_fpu: unload [005] 538597.212550: kvm_userspace_exit: reason KVM_EXIT_IO (2) and this likely lowers the line. I honestly have no idea how this all works on real hw but the comment in kernel ioapic says something about non-immediate delivery of the reasserted interrupt. True or not, this gives me some peace of mind :-)
> On 1 Apr 2019, at 20:28, Vitaly Kuznetsov <vkuznets@redhat.com> wrote: > > Liran Alon <liran.alon@oracle.com> writes: > >>> On 1 Apr 2019, at 18:58, Vitaly Kuznetsov <vkuznets@redhat.com> wrote: >>> >>> Liran Alon <liran.alon@oracle.com> writes: >>> >>>>> On 1 Apr 2019, at 16:36, Vitaly Kuznetsov <vkuznets@redhat.com> wrote: >>>>> >>>>> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + >>>>> piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that >>>>> it is mishandling level-triggered interrupt performing EOI without fixing >>>>> the root cause. >>>> >>>> I would rephrase as: >>>> It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + piix4-usb-uhci) hangs on boot. >>>> Root-cause was that one of Hyper-V level-triggered interrupt handler performs EOI before fixing the root-cause. >>>> This results in IOAPIC keep re-raising the level-triggered interrupt >>>> after EOI because irq-line remains asserted. >>> >>> Ok, thanks for the suggestion. >>> >>>> >>>>> This causes immediate re-assertion and L2 VM (which is >>>>> supposedly expected to fix the cause of the interrupt) is not making any >>>>> progress. >>>> >>>> I don’t know why you assume this. >>>> From the trace we have examined, it seems that the EOI is performed by Hyper-V and not it’s guest >>>> This means that the handler for this level-triggered interrupt is on >>>> Hyper-V and not it’s guest. >>> >>> If you let it run (with e.g. this patch or by setting preemtion timer > >>> 0) you'll see that MMIO write fixing the cause of the interrupt is >>> happening from L2: >>> >>> (qemu) info pci: >>> >>> Bus 0, device 4, function 0: >>> USB controller: PCI device 8086:7112 >>> PCI subsystem 1af4:1100 >>> IRQ 23. >>> BAR4: I/O at 0x6060 [0x607f]. >>> id "" >>> >>> ... >>> 538597.212494: kvm_exit: reason VMRESUME rip 0xfffff80004250115 info 0 0 >>> 538597.212499: kvm_entry: vcpu 0 >>> 538597.212506: kvm_exit: reason IO_INSTRUCTION rip 0xfffff80e02ac6a27 info 60620009 0 >>> 538597.212507: kvm_nested_vmexit: rip fffff80e02ac6a27 reason IO_INSTRUCTION info1 60620009 info2 0 int_info 0 int_info_err 0 >>> 538597.212509: kvm_fpu: unload >>> 538597.212511: kvm_userspace_exit: reason KVM_EXIT_IO (2) >>> 538597.212516: kvm_fpu: load >>> 538597.212518: kvm_pio: pio_read at 0x6062 size 2 count 1 val 0x1 >>> 538597.212519: kvm_entry: vcpu 0 >>> 538597.212523: kvm_exit: reason IO_INSTRUCTION rip 0xfffff80e02ac6a61 info 60640009 0 >>> 538597.212523: kvm_nested_vmexit: rip fffff80e02ac6a61 reason IO_INSTRUCTION info1 60640009 info2 0 int_info 0 int_info_err 0 >>> 538597.212524: kvm_fpu: unload >>> 538597.212525: kvm_userspace_exit: reason KVM_EXIT_IO (2) >>> 538597.212528: kvm_fpu: load >>> 538597.212528: kvm_pio: pio_read at 0x6064 size 2 count 1 val 0xf >>> ... >>> >>> and this happens after EOI from L1. >> >> I see that the L2 guest is doing I/O read to the device BAR4 but do these reads lower the irq-line? >> I would expect a write to lower the irq-line. >> >> Looking at uhci_port_read(), it seems that offset 0x02 and 0x04 just return a value. Doesn’t lower irq-line. >> (Even though offset 0x04 returns the "interrupt enable register”). >> In contrast, looking at uhci_port_write(), it seems that writing to either offset 0x02 or 0x04 could lower the irq-line. >> So you should look for pio_write to port 0x6062 or 0x6064 to see who >> is actually responsible for lowering the irq-line. > > Sorry, I probably like trimming traces too much. Writes happen too: > > [005] 538597.212532: kvm_exit: reason IO_INSTRUCTION rip 0xfffff80e02ac6a8f info 60620001 0 > [005] 538597.212533: kvm_nested_vmexit: rip fffff80e02ac6a8f reason IO_INSTRUCTION info1 60620001 info2 0 int_info 0 int_info_err 0 > [005] 538597.212534: kvm_pio: pio_write at 0x6062 size 2 count 1 val 0x1 This clears bit UHCI_STS_USBINT from “status” register and zero “status2” register. > [005] 538597.212534: kvm_fpu: unload > [005] 538597.212535: kvm_userspace_exit: reason KVM_EXIT_IO (2) > [005] 538597.212543: kvm_fpu: load > [005] 538597.212544: kvm_entry: vcpu 0 > [005] 538597.212547: kvm_exit: reason IO_INSTRUCTION rip 0xfffff80e02ac6a9c info 60640001 0 > [005] 538597.212548: kvm_nested_vmexit: rip fffff80e02ac6a9c reason IO_INSTRUCTION info1 60640001 info2 0 int_info 0 int_info_err 0 > [005] 538597.212548: kvm_pio: pio_write at 0x6064 size 2 count 1 val 0x0 This sets 0 in the “intr” register (Interrupt enable register). At this point it is indeed likely that uhci_update_irq() will lower the irq-line. I agree. :) > [005] 538597.212549: kvm_fpu: unload > [005] 538597.212550: kvm_userspace_exit: reason KVM_EXIT_IO (2) > > and this likely lowers the line. OK I was convinced that L2 does the irq-line lowering. > I honestly have no idea how this all > works on real hw but the comment in kernel ioapic says something about > non-immediate delivery of the reasserted interrupt. True or not, this > gives me some peace of mind :-) Yeah I know. It’s just weird that this non-immediate delivery translates to “that you can even resume into VMX non-root-mode and run a bunch of logic that lower irq-line” before interrupt is re-raised. That’s kinda crazy :) It’s more likely that this also reproduce on real hardware and that Microsoft maybe haven’t checked with this hardware? Anyway, we are on the same page here. :) -Liran > > -- > Vitaly
diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c index 9d75f84d3b..daf45cc8a8 100644 --- a/hw/intc/ioapic.c +++ b/hw/intc/ioapic.c @@ -139,6 +139,15 @@ static void ioapic_service(IOAPICCommonState *s) } } +#define SUCCESSIVE_IRQ_MAX_COUNT 10000 + +static void ioapic_timer(void *opaque) +{ + IOAPICCommonState *s = opaque; + + ioapic_service(s); +} + static void ioapic_set_irq(void *opaque, int vector, int level) { IOAPICCommonState *s = opaque; @@ -227,7 +236,28 @@ void ioapic_eoi_broadcast(int vector) trace_ioapic_clear_remote_irr(n, vector); s->ioredtbl[n] = entry & ~IOAPIC_LVT_REMOTE_IRR; if (!(entry & IOAPIC_LVT_MASKED) && (s->irr & (1 << n))) { - ioapic_service(s); + bool level = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) + == IOAPIC_TRIGGER_LEVEL; + + ++s->irq_reassert[vector]; + if (!level || + s->irq_reassert[vector] < SUCCESSIVE_IRQ_MAX_COUNT) { + ioapic_service(s); + } else { + /* + * Real hardware does not deliver the interrupt + * immediately during eoi broadcast, and this lets a + * buggy guest make slow progress even if it does not + * correctly handle a level-triggered interrupt. Emulate + * this behavior if we detect an interrupt storm. + */ + trace_ioapic_eoi_delayed_reassert(vector); + timer_mod(s->timer, + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + + NANOSECONDS_PER_SECOND / 100); + } + } else { + s->irq_reassert[vector] = 0; } } } @@ -401,6 +431,8 @@ static void ioapic_realize(DeviceState *dev, Error **errp) memory_region_init_io(&s->io_memory, OBJECT(s), &ioapic_io_ops, s, "ioapic", 0x1000); + s->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, ioapic_timer, s); + qdev_init_gpio_in(dev, ioapic_set_irq, IOAPIC_NUM_PINS); ioapics[ioapic_no] = s; @@ -408,6 +440,14 @@ static void ioapic_realize(DeviceState *dev, Error **errp) qemu_add_machine_init_done_notifier(&s->machine_done); } +static void ioapic_unrealize(DeviceState *dev, Error **errp) +{ + IOAPICCommonState *s = IOAPIC_COMMON(dev); + + timer_del(s->timer); + timer_free(s->timer); +} + static Property ioapic_properties[] = { DEFINE_PROP_UINT8("version", IOAPICCommonState, version, IOAPIC_VER_DEF), DEFINE_PROP_END_OF_LIST(), @@ -419,6 +459,7 @@ static void ioapic_class_init(ObjectClass *klass, void *data) DeviceClass *dc = DEVICE_CLASS(klass); k->realize = ioapic_realize; + k->unrealize = ioapic_unrealize; /* * If APIC is in kernel, we need to update the kernel cache after * migration, otherwise first 24 gsi routes will be invalid. diff --git a/hw/intc/trace-events b/hw/intc/trace-events index a28bdce925..90c9d07c1a 100644 --- a/hw/intc/trace-events +++ b/hw/intc/trace-events @@ -25,6 +25,7 @@ apic_mem_writel(uint64_t addr, uint32_t val) "0x%"PRIx64" = 0x%08x" ioapic_set_remote_irr(int n) "set remote irr for pin %d" ioapic_clear_remote_irr(int n, int vector) "clear remote irr for pin %d vector %d" ioapic_eoi_broadcast(int vector) "EOI broadcast for vector %d" +ioapic_eoi_delayed_reassert(int vector) "delayed reassert on EOI broadcast for vector %d" ioapic_mem_read(uint8_t addr, uint8_t regsel, uint8_t size, uint32_t val) "ioapic mem read addr 0x%"PRIx8" regsel: 0x%"PRIx8" size 0x%"PRIx8" retval 0x%"PRIx32 ioapic_mem_write(uint8_t addr, uint8_t regsel, uint8_t size, uint32_t val) "ioapic mem write addr 0x%"PRIx8" regsel: 0x%"PRIx8" size 0x%"PRIx8" val 0x%"PRIx32 ioapic_set_irq(int vector, int level) "vector: %d level: %d" diff --git a/include/hw/i386/ioapic_internal.h b/include/hw/i386/ioapic_internal.h index 9848f391bb..e0ee88db40 100644 --- a/include/hw/i386/ioapic_internal.h +++ b/include/hw/i386/ioapic_internal.h @@ -96,6 +96,7 @@ typedef struct IOAPICCommonClass { SysBusDeviceClass parent_class; DeviceRealize realize; + DeviceUnrealize unrealize; void (*pre_save)(IOAPICCommonState *s); void (*post_load)(IOAPICCommonState *s); } IOAPICCommonClass; @@ -111,6 +112,8 @@ struct IOAPICCommonState { uint8_t version; uint64_t irq_count[IOAPIC_NUM_PINS]; int irq_level[IOAPIC_NUM_PINS]; + int irq_reassert[IOAPIC_NUM_PINS]; + QEMUTimer *timer; }; void ioapic_reset_common(DeviceState *dev);
It was found that Hyper-V 2016 on KVM in some configurations (q35 machine + piix4-usb-uhci) hangs on boot. Trace analysis led us to the conclusion that it is mishandling level-triggered interrupt performing EOI without fixing the root cause. This causes immediate re-assertion and L2 VM (which is supposedly expected to fix the cause of the interrupt) is not making any progress. Gory details: https://www.spinics.net/lists/kvm/msg184484.html Turns out we were dealing with similar issues before; in-kernel IOAPIC implementation has commit 184564efae4d ("kvm: ioapic: conditionally delay irq delivery duringeoi broadcast") which describes a very similar issue. Steal the idea from the above mentioned commit for IOAPIC implementation in QEMU. SUCCESSIVE_IRQ_MAX_COUNT, delay and the comment are borrowed as well. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> --- hw/intc/ioapic.c | 43 ++++++++++++++++++++++++++++++- hw/intc/trace-events | 1 + include/hw/i386/ioapic_internal.h | 3 +++ 3 files changed, 46 insertions(+), 1 deletion(-)