Message ID | 1556601559-30921-1-git-send-email-chao.gao@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/pt: skip setup of posted format IRTE when gvec is 0 | expand |
>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote: > When testing with an UP guest with a pass-thru device with vt-d pi > enabled in host, we observed that guest couldn't receive interrupts > from that pass-thru device. Dumping IRTE, we found the corresponding > IRTE is set to posted format with "vector" field as 0. > > We would fall into this issue when guest used the pirq format of MSI > (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id' > is repurposed, skip migration which is based on 'dest_id'. I've gone through all uses of gvec, and I couldn't find any existing special casing of it being zero. I assume this is actually communication between the kernel and qemu, in which case I'd like to see an explanation of why the issue needs to be addressed in Xen rather than qemu. Otherwise, if I've overlooked something, would you mind pointing out where such special casing lives in Xen? In any event it doesn't look correct to skip migration altogether in that case. I'd rather expect it to require getting done differently. After all there still is a (CPU, vector) tuple associated with that {,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is posted. Finally it hasn't become clear to me why this is a UP-guest issue only. > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -413,34 +413,52 @@ int pt_irq_create_bind( > pirq_dpci->gmsi.gflags = gflags; > } > } > - /* Calculate dest_vcpu_id for MSI-type pirq migration. */ > - dest = MASK_EXTR(pirq_dpci->gmsi.gflags, > - XEN_DOMCTL_VMSI_X86_DEST_ID_MASK); > - dest_mode = pirq_dpci->gmsi.gflags & XEN_DOMCTL_VMSI_X86_DM_MASK; > - delivery_mode = MASK_EXTR(pirq_dpci->gmsi.gflags, > - XEN_DOMCTL_VMSI_X86_DELIV_MASK); > - > - dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode); > - pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id; > - spin_unlock(&d->event_lock); > - > - pirq_dpci->gmsi.posted = false; > - vcpu = (dest_vcpu_id >= 0) ? d->vcpu[dest_vcpu_id] : NULL; > - if ( iommu_intpost ) > + /* > + * Migrate pirq and create posted format IRTE only if we know the gmsi's > + * dest_id and vector. > + */ > + if ( pirq_dpci->gmsi.gvec ) If we're to go with this hypervisor side change, please insert a blank line above the comment. Jan
On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote: >>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote: >> When testing with an UP guest with a pass-thru device with vt-d pi >> enabled in host, we observed that guest couldn't receive interrupts >> from that pass-thru device. Dumping IRTE, we found the corresponding >> IRTE is set to posted format with "vector" field as 0. >> >> We would fall into this issue when guest used the pirq format of MSI >> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id' >> is repurposed, skip migration which is based on 'dest_id'. > >I've gone through all uses of gvec, and I couldn't find any existing >special casing of it being zero. I assume this is actually communication >between the kernel and qemu, Yes. >in which case I'd like to see an >explanation of why the issue needs to be addressed in Xen rather >than qemu. To call pirq_guest_bind() to configure irq_desc properly. Especially, we append a pointer of struct domain to 'action->guest' in pirq_guest_bind(). Then __do_IRQ_guest() knows domains that are interested in this interrupt and injects an interrupt to those domains. >Otherwise, if I've overlooked something, would you >mind pointing out where such special casing lives in Xen? > >In any event it doesn't look correct to skip migration altogether in >that case. I'd rather expect it to require getting done differently. >After all there still is a (CPU, vector) tuple associated with that >{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is >posted. Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu is running on to reduce IPI. But the 'dest_id' field which used to indicate the vmsi's target vcpu is missing, we don't know which cpu we should migrate the irq to. One possible choice is the 'chn->notify_vcpu_id' used in send_guest_pirq(). Do you think this choice is fine? > >Finally it hasn't become clear to me why this is a UP-guest issue >only. For SMP case, it happens to work. hvm_girq_dest_2_vcpu_id() would return -1 for SMP in most cases. And then we won't use VT-d PI if there is no dest vcpu. For UP case, hvm_girq_dest_2_vcpu_id() returns 0 without matching. > >> --- a/xen/drivers/passthrough/io.c >> +++ b/xen/drivers/passthrough/io.c >> @@ -413,34 +413,52 @@ int pt_irq_create_bind( >> pirq_dpci->gmsi.gflags = gflags; >> } >> } >> - /* Calculate dest_vcpu_id for MSI-type pirq migration. */ >> - dest = MASK_EXTR(pirq_dpci->gmsi.gflags, >> - XEN_DOMCTL_VMSI_X86_DEST_ID_MASK); >> - dest_mode = pirq_dpci->gmsi.gflags & XEN_DOMCTL_VMSI_X86_DM_MASK; >> - delivery_mode = MASK_EXTR(pirq_dpci->gmsi.gflags, >> - XEN_DOMCTL_VMSI_X86_DELIV_MASK); >> - >> - dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode); >> - pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id; >> - spin_unlock(&d->event_lock); >> - >> - pirq_dpci->gmsi.posted = false; >> - vcpu = (dest_vcpu_id >= 0) ? d->vcpu[dest_vcpu_id] : NULL; >> - if ( iommu_intpost ) >> + /* >> + * Migrate pirq and create posted format IRTE only if we know the gmsi's >> + * dest_id and vector. >> + */ >> + if ( pirq_dpci->gmsi.gvec ) > >If we're to go with this hypervisor side change, please insert a >blank line above the comment. Will do Thanks Chao
On Tue, Apr 30, 2019 at 01:19:19PM +0800, Chao Gao wrote: > When testing with an UP guest with a pass-thru device with vt-d pi > enabled in host, we observed that guest couldn't receive interrupts > from that pass-thru device. Dumping IRTE, we found the corresponding > IRTE is set to posted format with "vector" field as 0. > > We would fall into this issue when guest used the pirq format of MSI I think the above sentence is a bit confusing. I would rather say that the guest is routing interrupts from passthrough devices over event channels using pirqs. > (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id' > is repurposed, skip migration which is based on 'dest_id'. Like Jan, I wonder why the device model (I assume QEMU) issues a XEN_DOMCTL_bind_pt_irq hypercall when the guest is attempting to route this interrupts over an event channel, attempting to bind it to a vector seems like a bug in the device model. I would also consider whether it would make sense to not expose the XENFEAT_hvm_pirqs feature when doing passthrough if posted interrupts are supported, performance wise it's better to use posted interrupts rather than routing them over event channels (which requires Xen interaction). Note that I think this whole XENFEAT_hvm_pirqs is a big hack which abuses a hardware interface, and I would like to see it removed. Thanks, Roger.
On Tue, Apr 30, 2019 at 05:01:21PM +0800, Chao Gao wrote: > On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote: > >>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote: > >> When testing with an UP guest with a pass-thru device with vt-d pi > >> enabled in host, we observed that guest couldn't receive interrupts > >> from that pass-thru device. Dumping IRTE, we found the corresponding > >> IRTE is set to posted format with "vector" field as 0. > >> > >> We would fall into this issue when guest used the pirq format of MSI > >> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id' > >> is repurposed, skip migration which is based on 'dest_id'. > > > >I've gone through all uses of gvec, and I couldn't find any existing > >special casing of it being zero. I assume this is actually communication > >between the kernel and qemu, > > Yes. > > >in which case I'd like to see an > >explanation of why the issue needs to be addressed in Xen rather > >than qemu. > > To call pirq_guest_bind() to configure irq_desc properly. > Especially, we append a pointer of struct domain to 'action->guest' in > pirq_guest_bind(). Then __do_IRQ_guest() knows domains that are interested > in this interrupt and injects an interrupt to those domains. > > >Otherwise, if I've overlooked something, would you > >mind pointing out where such special casing lives in Xen? > > > >In any event it doesn't look correct to skip migration altogether in > >that case. I'd rather expect it to require getting done differently. > >After all there still is a (CPU, vector) tuple associated with that > >{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is > >posted. > > Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu > is running on to reduce IPI. But the 'dest_id' field which used to > indicate the vmsi's target vcpu is missing, we don't know which cpu we should > migrate the irq to. One possible choice is the 'chn->notify_vcpu_id' > used in send_guest_pirq(). Do you think this choice is fine? I think that by the time the device model calls into pirq_guest_bind the PIRQ won't be bound to any event channel, so pirq->evtchn would be 0. Note that the binding of the PIRQ with the event channel is done afterwards in xen_hvm_setup_msi_irqs by the Linux kernel. It seems like the device model should be using a different set of hypercalls to setup a PIRQ that is routed over an event channel, ie: PHYSDEVOP_map_pirq and friends. Thanks, Roger.
>>> On 30.04.19 at 11:01, <chao.gao@intel.com> wrote: > On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote: >>In any event it doesn't look correct to skip migration altogether in >>that case. I'd rather expect it to require getting done differently. >>After all there still is a (CPU, vector) tuple associated with that >>{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is >>posted. > > Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu > is running on to reduce IPI. But the 'dest_id' field which used to > indicate the vmsi's target vcpu is missing, we don't know which cpu we should > migrate the irq to. One possible choice is the 'chn->notify_vcpu_id' > used in send_guest_pirq(). Do you think this choice is fine? Well, the code you change is that binding the IRQ, not delivering it. Is the event channel set up already at all at that point? See also Roger's reply. >>Finally it hasn't become clear to me why this is a UP-guest issue >>only. > > For SMP case, it happens to work. hvm_girq_dest_2_vcpu_id() would return > -1 for SMP in most cases. And then we won't use VT-d PI if there is no > dest vcpu. For UP case, hvm_girq_dest_2_vcpu_id() returns 0 without > matching. "Happens to work" together with how hvm_girq_dest_2_vcpu_id() works looks to mean "happens to sometimes work" (maybe "often", but hardly "always"): The function easily can return other than -1 for multi-CPU guests as well. Which then is another reason to consider fixing the issue in qemu instead. Jan
On Tue, Apr 30, 2019 at 11:08:54AM +0200, Roger Pau Monné wrote: >On Tue, Apr 30, 2019 at 01:19:19PM +0800, Chao Gao wrote: >> When testing with an UP guest with a pass-thru device with vt-d pi >> enabled in host, we observed that guest couldn't receive interrupts >> from that pass-thru device. Dumping IRTE, we found the corresponding >> IRTE is set to posted format with "vector" field as 0. >> >> We would fall into this issue when guest used the pirq format of MSI > >I think the above sentence is a bit confusing. I would rather say that >the guest is routing interrupts from passthrough devices over event >channels using pirqs. Yes. It is better than it was. > >> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id' >> is repurposed, skip migration which is based on 'dest_id'. > >Like Jan, I wonder why the device model (I assume QEMU) issues a >XEN_DOMCTL_bind_pt_irq hypercall when the guest is attempting to route >this interrupts over an event channel, attempting to bind it to a >vector seems like a bug in the device model. > >I would also consider whether it would make sense to not expose the >XENFEAT_hvm_pirqs feature when doing passthrough if posted interrupts >are supported, performance wise it's better to use posted interrupts >rather than routing them over event channels (which requires Xen >interaction). It is reasonable. But I think currently guest can add "xen_nopv" to its kernel boot options to avoid using evtchn. There might be some cases even with pass-thru devices (such as, guest uses polling mode driver for pass-thru devices), we care more about the efficiency of delivering virtual interrupts. So a separate option for evtchn in guest kernel seems like a better choice. Thanks Chao > >Note that I think this whole XENFEAT_hvm_pirqs is a big hack which >abuses a hardware interface, and I would like to see it removed. > >Thanks, Roger. > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xenproject.org >https://lists.xenproject.org/mailman/listinfo/xen-devel
On Tue, Apr 30, 2019 at 11:30:33AM +0200, Roger Pau Monné wrote: >On Tue, Apr 30, 2019 at 05:01:21PM +0800, Chao Gao wrote: >> On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote: >> >>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote: >> >> When testing with an UP guest with a pass-thru device with vt-d pi >> >> enabled in host, we observed that guest couldn't receive interrupts >> >> from that pass-thru device. Dumping IRTE, we found the corresponding >> >> IRTE is set to posted format with "vector" field as 0. >> >> >> >> We would fall into this issue when guest used the pirq format of MSI >> >> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id' >> >> is repurposed, skip migration which is based on 'dest_id'. >> > >> >I've gone through all uses of gvec, and I couldn't find any existing >> >special casing of it being zero. I assume this is actually communication >> >between the kernel and qemu, >> >> Yes. >> >> >in which case I'd like to see an >> >explanation of why the issue needs to be addressed in Xen rather >> >than qemu. >> >> To call pirq_guest_bind() to configure irq_desc properly. >> Especially, we append a pointer of struct domain to 'action->guest' in >> pirq_guest_bind(). Then __do_IRQ_guest() knows domains that are interested >> in this interrupt and injects an interrupt to those domains. >> >> >Otherwise, if I've overlooked something, would you >> >mind pointing out where such special casing lives in Xen? >> > >> >In any event it doesn't look correct to skip migration altogether in >> >that case. I'd rather expect it to require getting done differently. >> >After all there still is a (CPU, vector) tuple associated with that >> >{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is >> >posted. >> >> Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu >> is running on to reduce IPI. But the 'dest_id' field which used to >> indicate the vmsi's target vcpu is missing, we don't know which cpu we should >> migrate the irq to. One possible choice is the 'chn->notify_vcpu_id' >> used in send_guest_pirq(). Do you think this choice is fine? > >I think that by the time the device model calls into pirq_guest_bind >the PIRQ won't be bound to any event channel, so pirq->evtchn would be >0. Then skip pirq migration is the only choice here? And we can migrate pirq when it is bound with an event channel. > >Note that the binding of the PIRQ with the event channel is done >afterwards in xen_hvm_setup_msi_irqs by the Linux kernel. > >It seems like the device model should be using a different set of >hypercalls to setup a PIRQ that is routed over an event channel, ie: >PHYSDEVOP_map_pirq and friends. Now qemu is using PHYSDEVOP_map_pirq. Right? Thanks Chao
On 30/04/2019 17:24, Chao Gao wrote: > On Tue, Apr 30, 2019 at 11:08:54AM +0200, Roger Pau Monné wrote: >> On Tue, Apr 30, 2019 at 01:19:19PM +0800, Chao Gao wrote: >>> When testing with an UP guest with a pass-thru device with vt-d pi >>> enabled in host, we observed that guest couldn't receive interrupts >>> from that pass-thru device. Dumping IRTE, we found the corresponding >>> IRTE is set to posted format with "vector" field as 0. >>> >>> We would fall into this issue when guest used the pirq format of MSI >> I think the above sentence is a bit confusing. I would rather say that >> the guest is routing interrupts from passthrough devices over event >> channels using pirqs. > Yes. It is better than it was. > >>> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id' >>> is repurposed, skip migration which is based on 'dest_id'. >> Like Jan, I wonder why the device model (I assume QEMU) issues a >> XEN_DOMCTL_bind_pt_irq hypercall when the guest is attempting to route >> this interrupts over an event channel, attempting to bind it to a >> vector seems like a bug in the device model. >> >> I would also consider whether it would make sense to not expose the >> XENFEAT_hvm_pirqs feature when doing passthrough if posted interrupts >> are supported, performance wise it's better to use posted interrupts >> rather than routing them over event channels (which requires Xen >> interaction). > It is reasonable. But I think currently guest can add "xen_nopv" to its > kernel boot options to avoid using evtchn. There might be some cases > even with pass-thru devices (such as, guest uses polling mode > driver for pass-thru devices), we care more about the efficiency of > delivering virtual interrupts. So a separate option for evtchn in guest > kernel seems like a better choice. This is a festering swamp, and is going to need some careful architectural work to untangle. At the moment, guests which are xen-aware will try to use event channels for everything. I'm pretty sure we've got some ABI incompatibilities here with hardware APIC acceleration, and I still have yet(!) to diagnose the underlying problem which is preventing XenServer from enabling hardware APIC acceleration. (VMs still intermittently wedge on migrate with what appears to be lost interrupt.) The legacy HVM callback via single vector is incompatible with hardware APIC acceleration, because it exists specifically to skip going through the IRR. I can't think a proper solution to this, other than crashing the guest when it tries to set such a situation up. From an "architecturally ideal" point of view, we obviously want to be using APICV and/or Posted interrupt support whenever possible, and be using these methods in preference to event channels. Injecting interrupts (including signalling an event upcall) using the PIR is more efficient than other means, so long as we can ensure that guest will handle the vector like any other edge triggered interrupt, and ack it at the LAPIC. (This is where the legacy callback via method breaks down, as it was intentionally designed to be used without an ack at the LAPIC.) As soon as we get to device interrupts, posted is definitely the way to go, especially as the guest should not be aware of the routing behind the scenes in the first place. To be clear, Chao - I'm not asking you to fix this. This is a decade of technical debt which has been steadily growing, and it needs to start with a coherent plan what the current state of play is, and how we'd like to proceed. ~Andrew
On Wed, May 01, 2019 at 12:41:13AM +0800, Chao Gao wrote: > On Tue, Apr 30, 2019 at 11:30:33AM +0200, Roger Pau Monné wrote: > >On Tue, Apr 30, 2019 at 05:01:21PM +0800, Chao Gao wrote: > >> On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote: > >> >>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote: > >> >> When testing with an UP guest with a pass-thru device with vt-d pi > >> >> enabled in host, we observed that guest couldn't receive interrupts > >> >> from that pass-thru device. Dumping IRTE, we found the corresponding > >> >> IRTE is set to posted format with "vector" field as 0. > >> >> > >> >> We would fall into this issue when guest used the pirq format of MSI > >> >> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id' > >> >> is repurposed, skip migration which is based on 'dest_id'. > >> > > >> >I've gone through all uses of gvec, and I couldn't find any existing > >> >special casing of it being zero. I assume this is actually communication > >> >between the kernel and qemu, > >> > >> Yes. > >> > >> >in which case I'd like to see an > >> >explanation of why the issue needs to be addressed in Xen rather > >> >than qemu. > >> > >> To call pirq_guest_bind() to configure irq_desc properly. > >> Especially, we append a pointer of struct domain to 'action->guest' in > >> pirq_guest_bind(). Then __do_IRQ_guest() knows domains that are interested > >> in this interrupt and injects an interrupt to those domains. > >> > >> >Otherwise, if I've overlooked something, would you > >> >mind pointing out where such special casing lives in Xen? > >> > > >> >In any event it doesn't look correct to skip migration altogether in > >> >that case. I'd rather expect it to require getting done differently. > >> >After all there still is a (CPU, vector) tuple associated with that > >> >{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is > >> >posted. > >> > >> Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu > >> is running on to reduce IPI. But the 'dest_id' field which used to > >> indicate the vmsi's target vcpu is missing, we don't know which cpu we should > >> migrate the irq to. One possible choice is the 'chn->notify_vcpu_id' > >> used in send_guest_pirq(). Do you think this choice is fine? > > > >I think that by the time the device model calls into pirq_guest_bind > >the PIRQ won't be bound to any event channel, so pirq->evtchn would be > >0. > > Then skip pirq migration is the only choice here? And we can migrate > pirq when it is bound with an event channel. > > > > >Note that the binding of the PIRQ with the event channel is done > >afterwards in xen_hvm_setup_msi_irqs by the Linux kernel. > > > >It seems like the device model should be using a different set of > >hypercalls to setup a PIRQ that is routed over an event channel, ie: > >PHYSDEVOP_map_pirq and friends. > > Now qemu is using PHYSDEVOP_map_pirq. Right? Oh yes, QEMU already uses PHYSDEVOP_map_pirq to setup the interrupt. Then I'm not sure I see why QEMU calls XEN_DOMCTL_bind_pt_irq for interrupts that are routed over event channels. That hypercall is used to bind a pirq to a native guest interrupt injection mechanism, which shouldn't be used if the interrupt is going to be delivered over an event channel. Can you see about avoiding the XEN_DOMCTL_bind_pt_irq call in QEMU if the interrupt is going to be routed over an event channel? That would avoid having to add more quirks to the hypercall implementation. Thanks, Roger.
On Tue, Apr 30, 2019 at 05:48:14PM +0100, Andrew Cooper wrote: > On 30/04/2019 17:24, Chao Gao wrote: > > On Tue, Apr 30, 2019 at 11:08:54AM +0200, Roger Pau Monné wrote: > >> On Tue, Apr 30, 2019 at 01:19:19PM +0800, Chao Gao wrote: > >>> When testing with an UP guest with a pass-thru device with vt-d pi > >>> enabled in host, we observed that guest couldn't receive interrupts > >>> from that pass-thru device. Dumping IRTE, we found the corresponding > >>> IRTE is set to posted format with "vector" field as 0. > >>> > >>> We would fall into this issue when guest used the pirq format of MSI > >> I think the above sentence is a bit confusing. I would rather say that > >> the guest is routing interrupts from passthrough devices over event > >> channels using pirqs. > > Yes. It is better than it was. > > > >>> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id' > >>> is repurposed, skip migration which is based on 'dest_id'. > >> Like Jan, I wonder why the device model (I assume QEMU) issues a > >> XEN_DOMCTL_bind_pt_irq hypercall when the guest is attempting to route > >> this interrupts over an event channel, attempting to bind it to a > >> vector seems like a bug in the device model. > >> > >> I would also consider whether it would make sense to not expose the > >> XENFEAT_hvm_pirqs feature when doing passthrough if posted interrupts > >> are supported, performance wise it's better to use posted interrupts > >> rather than routing them over event channels (which requires Xen > >> interaction). > > It is reasonable. But I think currently guest can add "xen_nopv" to its > > kernel boot options to avoid using evtchn. There might be some cases > > even with pass-thru devices (such as, guest uses polling mode > > driver for pass-thru devices), we care more about the efficiency of > > delivering virtual interrupts. So a separate option for evtchn in guest > > kernel seems like a better choice. > > This is a festering swamp, and is going to need some careful > architectural work to untangle. > > At the moment, guests which are xen-aware will try to use event channels > for everything. I'm pretty sure we've got some ABI incompatibilities > here with hardware APIC acceleration, and I still have yet(!) to > diagnose the underlying problem which is preventing XenServer from > enabling hardware APIC acceleration. (VMs still intermittently wedge on > migrate with what appears to be lost interrupt.) My opinion is that XENFEAT_hvm_pirqs should be removed: - It abuses of a hardware interface to pass Xen specific information. When routing pirqs over event channels for HVM the pirq is passed on the MSI address field, hijacking the native dest_id field. - Cannot be used with posted interrupts or APICV. Iff routing physical interrupts over event channels is still a useful thing for HVM, it should be done using hypercalls, like it's done on a PV dom0. > > The legacy HVM callback via single vector is incompatible with hardware > APIC acceleration, because it exists specifically to skip going through > the IRR. I can't think a proper solution to this, other than crashing > the guest when it tries to set such a situation up. > > From an "architecturally ideal" point of view, we obviously want to be > using APICV and/or Posted interrupt support whenever possible, and be > using these methods in preference to event channels. > > Injecting interrupts (including signalling an event upcall) using the > PIR is more efficient than other means, so long as we can ensure that > guest will handle the vector like any other edge triggered interrupt, > and ack it at the LAPIC. (This is where the legacy callback via method > breaks down, as it was intentionally designed to be used without an ack > at the LAPIC.) IIRC there's already a correct way to setup an event channel upcall against a vector using HVMOP_set_evtchn_upcall_vector which requires an APIC ack. Now the issue is switching existing users to this new callback setup mechanism. > As soon as we get to device interrupts, posted is definitely the way to > go, especially as the guest should not be aware of the routing behind > the scenes in the first place. > > To be clear, Chao - I'm not asking you to fix this. This is a decade of > technical debt which has been steadily growing, and it needs to start > with a coherent plan what the current state of play is, and how we'd > like to proceed. I agree. I've suggested a possible fix for the problem at hand, but as stated above I think XENFEAT_hvm_pirqs should be removed. Thanks, Roger.
On Thu, May 02, 2019 at 10:20:09AM +0200, Roger Pau Monné wrote: >On Wed, May 01, 2019 at 12:41:13AM +0800, Chao Gao wrote: >> On Tue, Apr 30, 2019 at 11:30:33AM +0200, Roger Pau Monné wrote: >> >On Tue, Apr 30, 2019 at 05:01:21PM +0800, Chao Gao wrote: >> >> On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote: >> >> >>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote: >> >> >> When testing with an UP guest with a pass-thru device with vt-d pi >> >> >> enabled in host, we observed that guest couldn't receive interrupts >> >> >> from that pass-thru device. Dumping IRTE, we found the corresponding >> >> >> IRTE is set to posted format with "vector" field as 0. >> >> >> >> >> >> We would fall into this issue when guest used the pirq format of MSI >> >> >> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id' >> >> >> is repurposed, skip migration which is based on 'dest_id'. >> >> > >> >> >I've gone through all uses of gvec, and I couldn't find any existing >> >> >special casing of it being zero. I assume this is actually communication >> >> >between the kernel and qemu, >> >> >> >> Yes. >> >> >> >> >in which case I'd like to see an >> >> >explanation of why the issue needs to be addressed in Xen rather >> >> >than qemu. >> >> >> >> To call pirq_guest_bind() to configure irq_desc properly. >> >> Especially, we append a pointer of struct domain to 'action->guest' in >> >> pirq_guest_bind(). Then __do_IRQ_guest() knows domains that are interested >> >> in this interrupt and injects an interrupt to those domains. >> >> >> >> >Otherwise, if I've overlooked something, would you >> >> >mind pointing out where such special casing lives in Xen? >> >> > >> >> >In any event it doesn't look correct to skip migration altogether in >> >> >that case. I'd rather expect it to require getting done differently. >> >> >After all there still is a (CPU, vector) tuple associated with that >> >> >{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is >> >> >posted. >> >> >> >> Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu >> >> is running on to reduce IPI. But the 'dest_id' field which used to >> >> indicate the vmsi's target vcpu is missing, we don't know which cpu we should >> >> migrate the irq to. One possible choice is the 'chn->notify_vcpu_id' >> >> used in send_guest_pirq(). Do you think this choice is fine? >> > >> >I think that by the time the device model calls into pirq_guest_bind >> >the PIRQ won't be bound to any event channel, so pirq->evtchn would be >> >0. >> >> Then skip pirq migration is the only choice here? And we can migrate >> pirq when it is bound with an event channel. >> >> > >> >Note that the binding of the PIRQ with the event channel is done >> >afterwards in xen_hvm_setup_msi_irqs by the Linux kernel. >> > >> >It seems like the device model should be using a different set of >> >hypercalls to setup a PIRQ that is routed over an event channel, ie: >> >PHYSDEVOP_map_pirq and friends. >> >> Now qemu is using PHYSDEVOP_map_pirq. Right? > >Oh yes, QEMU already uses PHYSDEVOP_map_pirq to setup the interrupt. >Then I'm not sure I see why QEMU calls XEN_DOMCTL_bind_pt_irq for >interrupts that are routed over event channels. That hypercall is used As I said above, it is to call pirq_guest_bind() to hook up to irq handler. XEN_DOMCTL_bind_pt_pirq does two things: #1. bind pirq with a guest interrupt #2. register (domain,pirq) to the interrupt handler currently, for pirq routed to evtchn, #1 is done by another hypercall, evtchn_bind_pirq. and #2 is done in XEN_DOMCTL_bind_pt_irq. >to bind a pirq to a native guest interrupt injection mechanism, which >shouldn't be used if the interrupt is going to be delivered over an >event channel. > >Can you see about avoiding the XEN_DOMCTL_bind_pt_irq call in QEMU if >the interrupt is going to be routed over an event channel? Yes. It is doable. But it needs changes in both qemu and Xen and some tricks to be compatible with old qemu. I prefer not to touch qemu and keep qemu unware of MSI's "routing over evtchn", like the patch below: diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c index e86e2bf..0bcddb9 100644 --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -504,10 +504,7 @@ static long evtchn_bind_pirq(evtchn_bind_pirq_t *bind) if ( !info ) ERROR_EXIT(-ENOMEM); info->evtchn = port; - rc = (!is_hvm_domain(d) - ? pirq_guest_bind(v, info, - !!(bind->flags & BIND_PIRQ__WILL_SHARE)) - : 0); + rc = pirq_guest_bind(v, info, !!(bind->flags & BIND_PIRQ__WILL_SHARE)); if ( rc != 0 ) { info->evtchn = 0; diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c index 4290c7c..5a0b83e 100644 --- a/xen/drivers/passthrough/io.c +++ b/xen/drivers/passthrough/io.c @@ -346,6 +346,12 @@ int pt_irq_create_bind( uint32_t gflags = pt_irq_bind->u.msi.gflags & ~XEN_DOMCTL_VMSI_X86_UNMASKED; + if ( !pt_irq_bind->u.msi.gvec ) + { + spin_unlock(&d->event_lock); + return 0; + } + if ( !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) ) { pirq_dpci->flags = HVM_IRQ_DPCI_MAPPED | HVM_IRQ_DPCI_MACH_MSI | Thanks Chao
>>> On 06.05.19 at 06:44, <chao.gao@intel.com> wrote: > On Thu, May 02, 2019 at 10:20:09AM +0200, Roger Pau Monné wrote: >>Can you see about avoiding the XEN_DOMCTL_bind_pt_irq call in QEMU if >>the interrupt is going to be routed over an event channel? > > Yes. It is doable. But it needs changes in both qemu and Xen and some tricks > to be compatible with old qemu. That would be ugly indeed. > I prefer not to touch qemu and keep qemu unware of MSI's "routing over evtchn", > like the patch below: Is this meant as a replacement to your original patch, or as an add-on? In any event it's not immediately clear to me how ... > --- a/xen/common/event_channel.c > +++ b/xen/common/event_channel.c > @@ -504,10 +504,7 @@ static long evtchn_bind_pirq(evtchn_bind_pirq_t *bind) > if ( !info ) > ERROR_EXIT(-ENOMEM); > info->evtchn = port; > - rc = (!is_hvm_domain(d) > - ? pirq_guest_bind(v, info, > - !!(bind->flags & BIND_PIRQ__WILL_SHARE)) > - : 0); > + rc = pirq_guest_bind(v, info, !!(bind->flags & BIND_PIRQ__WILL_SHARE)); ... this becoming unconditional won't conflict with its other invocation ... > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -346,6 +346,12 @@ int pt_irq_create_bind( > uint32_t gflags = pt_irq_bind->u.msi.gflags & > ~XEN_DOMCTL_VMSI_X86_UNMASKED; > > + if ( !pt_irq_bind->u.msi.gvec ) > + { > + spin_unlock(&d->event_lock); > + return 0; > + } ... further down in this function, for the non-MSI case. Similarly I wonder whether the respective unbind function invocations then won't go (or already are?) out of sync. Jan
On Mon, May 06, 2019 at 12:44:41PM +0800, Chao Gao wrote: > On Thu, May 02, 2019 at 10:20:09AM +0200, Roger Pau Monné wrote: > >On Wed, May 01, 2019 at 12:41:13AM +0800, Chao Gao wrote: > >> On Tue, Apr 30, 2019 at 11:30:33AM +0200, Roger Pau Monné wrote: > >> >On Tue, Apr 30, 2019 at 05:01:21PM +0800, Chao Gao wrote: > >> >> On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote: > >> >> >>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote: > >> >> >> When testing with an UP guest with a pass-thru device with vt-d pi > >> >> >> enabled in host, we observed that guest couldn't receive interrupts > >> >> >> from that pass-thru device. Dumping IRTE, we found the corresponding > >> >> >> IRTE is set to posted format with "vector" field as 0. > >> >> >> > >> >> >> We would fall into this issue when guest used the pirq format of MSI > >> >> >> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id' > >> >> >> is repurposed, skip migration which is based on 'dest_id'. > >> >> > > >> >> >I've gone through all uses of gvec, and I couldn't find any existing > >> >> >special casing of it being zero. I assume this is actually communication > >> >> >between the kernel and qemu, > >> >> > >> >> Yes. > >> >> > >> >> >in which case I'd like to see an > >> >> >explanation of why the issue needs to be addressed in Xen rather > >> >> >than qemu. > >> >> > >> >> To call pirq_guest_bind() to configure irq_desc properly. > >> >> Especially, we append a pointer of struct domain to 'action->guest' in > >> >> pirq_guest_bind(). Then __do_IRQ_guest() knows domains that are interested > >> >> in this interrupt and injects an interrupt to those domains. > >> >> > >> >> >Otherwise, if I've overlooked something, would you > >> >> >mind pointing out where such special casing lives in Xen? > >> >> > > >> >> >In any event it doesn't look correct to skip migration altogether in > >> >> >that case. I'd rather expect it to require getting done differently. > >> >> >After all there still is a (CPU, vector) tuple associated with that > >> >> >{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is > >> >> >posted. > >> >> > >> >> Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu > >> >> is running on to reduce IPI. But the 'dest_id' field which used to > >> >> indicate the vmsi's target vcpu is missing, we don't know which cpu we should > >> >> migrate the irq to. One possible choice is the 'chn->notify_vcpu_id' > >> >> used in send_guest_pirq(). Do you think this choice is fine? > >> > > >> >I think that by the time the device model calls into pirq_guest_bind > >> >the PIRQ won't be bound to any event channel, so pirq->evtchn would be > >> >0. > >> > >> Then skip pirq migration is the only choice here? And we can migrate > >> pirq when it is bound with an event channel. > >> > >> > > >> >Note that the binding of the PIRQ with the event channel is done > >> >afterwards in xen_hvm_setup_msi_irqs by the Linux kernel. > >> > > >> >It seems like the device model should be using a different set of > >> >hypercalls to setup a PIRQ that is routed over an event channel, ie: > >> >PHYSDEVOP_map_pirq and friends. > >> > >> Now qemu is using PHYSDEVOP_map_pirq. Right? > > > >Oh yes, QEMU already uses PHYSDEVOP_map_pirq to setup the interrupt. > >Then I'm not sure I see why QEMU calls XEN_DOMCTL_bind_pt_irq for > >interrupts that are routed over event channels. That hypercall is used > > As I said above, it is to call pirq_guest_bind() to hook up to irq handler. > > XEN_DOMCTL_bind_pt_pirq does two things: > #1. bind pirq with a guest interrupt > #2. register (domain,pirq) to the interrupt handler > > currently, for pirq routed to evtchn, #1 is done by another hypercall, > evtchn_bind_pirq. and #2 is done in XEN_DOMCTL_bind_pt_irq. So XEN_DOMCTL_bind_pt_irq basically does the pirq_guest_bind in this case, and that's why the call to pirq_guest_bind is avoided in evtchn_bind_pirq for HVM guests. > > >to bind a pirq to a native guest interrupt injection mechanism, which > >shouldn't be used if the interrupt is going to be delivered over an > >event channel. > > > >Can you see about avoiding the XEN_DOMCTL_bind_pt_irq call in QEMU if > >the interrupt is going to be routed over an event channel? > > Yes. It is doable. But it needs changes in both qemu and Xen and some tricks > to be compatible with old qemu. OK, leaving the XEN_DOMCTL_bind_pt_irq call in QEMU and making it a no-op in this case seems like a good compromise solution IMO. It might be helpful to add a comment to the QEMU code noting that the XEN_DOMCTL_bind_pt_irq hypercall is not needed when routing pirqs over event channels for HVM guests with Xen >= 4.13. > I prefer not to touch qemu and keep qemu unware of MSI's "routing over evtchn", > like the patch below: > > diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c > index e86e2bf..0bcddb9 100644 > --- a/xen/common/event_channel.c > +++ b/xen/common/event_channel.c > @@ -504,10 +504,7 @@ static long evtchn_bind_pirq(evtchn_bind_pirq_t *bind) > if ( !info ) > ERROR_EXIT(-ENOMEM); > info->evtchn = port; > - rc = (!is_hvm_domain(d) > - ? pirq_guest_bind(v, info, > - !!(bind->flags & BIND_PIRQ__WILL_SHARE)) > - : 0); > + rc = pirq_guest_bind(v, info, !!(bind->flags & BIND_PIRQ__WILL_SHARE)); > if ( rc != 0 ) > { > info->evtchn = 0; > diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c > index 4290c7c..5a0b83e 100644 > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -346,6 +346,12 @@ int pt_irq_create_bind( > uint32_t gflags = pt_irq_bind->u.msi.gflags & > ~XEN_DOMCTL_VMSI_X86_UNMASKED; > > + if ( !pt_irq_bind->u.msi.gvec ) > + { > + spin_unlock(&d->event_lock); > + return 0; > + } > + Can you see about short-circuiting pt_irq_create_bind much earlier? Likely at the start of the function, AFAICT pirqs routed over event channels don't need hvm_irq_dpci, so I think you can avoid allocating it if all pirqs from a domain are routed over event channels. Thanks, Roger.
On Mon, May 06, 2019 at 03:39:40AM -0600, Jan Beulich wrote: >>>> On 06.05.19 at 06:44, <chao.gao@intel.com> wrote: >> On Thu, May 02, 2019 at 10:20:09AM +0200, Roger Pau Monné wrote: >>>Can you see about avoiding the XEN_DOMCTL_bind_pt_irq call in QEMU if >>>the interrupt is going to be routed over an event channel? >> >> Yes. It is doable. But it needs changes in both qemu and Xen and some tricks >> to be compatible with old qemu. > >That would be ugly indeed. > >> I prefer not to touch qemu and keep qemu unware of MSI's "routing over evtchn", >> like the patch below: > >Is this meant as a replacement to your original patch, or as an >add-on? In any event it's not immediately clear to me how A replacement. >... > >> --- a/xen/common/event_channel.c >> +++ b/xen/common/event_channel.c >> @@ -504,10 +504,7 @@ static long evtchn_bind_pirq(evtchn_bind_pirq_t *bind) >> if ( !info ) >> ERROR_EXIT(-ENOMEM); >> info->evtchn = port; >> - rc = (!is_hvm_domain(d) >> - ? pirq_guest_bind(v, info, >> - !!(bind->flags & BIND_PIRQ__WILL_SHARE)) >> - : 0); >> + rc = pirq_guest_bind(v, info, !!(bind->flags & BIND_PIRQ__WILL_SHARE)); > >... this becoming unconditional won't conflict with its other >invocation ... Yes. It conflicts with the call in pt_irq_create_bind() for non-MSI case. > >> --- a/xen/drivers/passthrough/io.c >> +++ b/xen/drivers/passthrough/io.c >> @@ -346,6 +346,12 @@ int pt_irq_create_bind( >> uint32_t gflags = pt_irq_bind->u.msi.gflags & >> ~XEN_DOMCTL_VMSI_X86_UNMASKED; >> >> + if ( !pt_irq_bind->u.msi.gvec ) >> + { >> + spin_unlock(&d->event_lock); >> + return 0; >> + } > >... further down in this function, for the non-MSI case. >Similarly I wonder whether the respective unbind function >invocations then won't go (or already are?) out of sync. The "out of sync" issue seems hard to be solved. It is error-prone to move pirq_guest_(un)bind from one hypercall to another. On second thought, I plan to go back to my original patch. The only issue for that patch is how to migrate irq properly to avoid IPI during interrupt delivery. Actually, current code has set irq affinity correctly: 1. pirq is bound to vcpu[0] in pt_irq_create_bind(). It also sets corresponding physical irq's affinity to vcpu[0]. 2. evtchn is bound to vcpu[0] in evtchn_bind_pirq(). During delivery, we would send pirq to vcpu[0] and no IPI is required. 3. If evtchn is rebound to another vcpu in evtchn_bind_vcpu(), the affinity of the physical irq is already reconfigured there. Thanks Chao
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c index 4290c7c..362d4bd 100644 --- a/xen/drivers/passthrough/io.c +++ b/xen/drivers/passthrough/io.c @@ -413,34 +413,52 @@ int pt_irq_create_bind( pirq_dpci->gmsi.gflags = gflags; } } - /* Calculate dest_vcpu_id for MSI-type pirq migration. */ - dest = MASK_EXTR(pirq_dpci->gmsi.gflags, - XEN_DOMCTL_VMSI_X86_DEST_ID_MASK); - dest_mode = pirq_dpci->gmsi.gflags & XEN_DOMCTL_VMSI_X86_DM_MASK; - delivery_mode = MASK_EXTR(pirq_dpci->gmsi.gflags, - XEN_DOMCTL_VMSI_X86_DELIV_MASK); - - dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode); - pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id; - spin_unlock(&d->event_lock); - - pirq_dpci->gmsi.posted = false; - vcpu = (dest_vcpu_id >= 0) ? d->vcpu[dest_vcpu_id] : NULL; - if ( iommu_intpost ) + /* + * Migrate pirq and create posted format IRTE only if we know the gmsi's + * dest_id and vector. + */ + if ( pirq_dpci->gmsi.gvec ) { - if ( delivery_mode == dest_LowestPrio ) - vcpu = vector_hashing_dest(d, dest, dest_mode, - pirq_dpci->gmsi.gvec); - if ( vcpu ) - pirq_dpci->gmsi.posted = true; + /* Calculate dest_vcpu_id for MSI-type pirq migration. */ + dest = MASK_EXTR(pirq_dpci->gmsi.gflags, + XEN_DOMCTL_VMSI_X86_DEST_ID_MASK); + dest_mode = pirq_dpci->gmsi.gflags & XEN_DOMCTL_VMSI_X86_DM_MASK; + delivery_mode = MASK_EXTR(pirq_dpci->gmsi.gflags, + XEN_DOMCTL_VMSI_X86_DELIV_MASK); + + dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode); + pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id; + spin_unlock(&d->event_lock); + + pirq_dpci->gmsi.posted = false; + vcpu = (dest_vcpu_id >= 0) ? d->vcpu[dest_vcpu_id] : NULL; + if ( iommu_intpost ) + { + if ( delivery_mode == dest_LowestPrio ) + vcpu = vector_hashing_dest(d, dest, dest_mode, + pirq_dpci->gmsi.gvec); + if ( vcpu ) + pirq_dpci->gmsi.posted = true; + } + if ( vcpu && iommu_enabled ) + hvm_migrate_pirq(pirq_dpci, vcpu); + + /* Use interrupt posting if it is supported. */ + if ( iommu_intpost ) + pi_update_irte(vcpu ? &vcpu->arch.hvm.vmx.pi_desc : NULL, + info, pirq_dpci->gmsi.gvec); } - if ( vcpu && iommu_enabled ) - hvm_migrate_pirq(pirq_dpci, vcpu); + else /* pirq_dpci->gmsi.gvec == 0 */ + { + pirq_dpci->gmsi.dest_vcpu_id = -1; + spin_unlock(&d->event_lock); - /* Use interrupt posting if it is supported. */ - if ( iommu_intpost ) - pi_update_irte(vcpu ? &vcpu->arch.hvm.vmx.pi_desc : NULL, - info, pirq_dpci->gmsi.gvec); + if ( unlikely(pirq_dpci->gmsi.posted) ) + { + pi_update_irte(NULL, info, 0); + pirq_dpci->gmsi.posted = false; + } + } if ( pt_irq_bind->u.msi.gflags & XEN_DOMCTL_VMSI_X86_UNMASKED ) {
When testing with an UP guest with a pass-thru device with vt-d pi enabled in host, we observed that guest couldn't receive interrupts from that pass-thru device. Dumping IRTE, we found the corresponding IRTE is set to posted format with "vector" field as 0. We would fall into this issue when guest used the pirq format of MSI (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id' is repurposed, skip migration which is based on 'dest_id'. Signed-off-by: Chao Gao <chao.gao@intel.com> --- xen/drivers/passthrough/io.c | 68 ++++++++++++++++++++++++++++---------------- 1 file changed, 43 insertions(+), 25 deletions(-)