x86/pt: skip setup of posted format IRTE when gvec is 0

Message ID	1556601559-30921-1-git-send-email-chao.gao@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> From: Chao Gao <chao.gao@intel.com> To: xen-devel@lists.xenproject.org Date: Tue, 30 Apr 2019 13:19:19 +0800 Message-Id: <1556601559-30921-1-git-send-email-chao.gao@intel.com> Subject: [Xen-devel] [PATCH] x86/pt: skip setup of posted format IRTE when gvec is 0 Precedence: list Cc: Jan Beulich <jbeulich@suse.com>, Chao Gao <chao.gao@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	x86/pt: skip setup of posted format IRTE when gvec is 0 \| expand x86/pt: skip setup of posted format IRTE when gvec is 0

Chao Gao April 30, 2019, 5:19 a.m. UTC

When testing with an UP guest with a pass-thru device with vt-d pi
enabled in host, we observed that guest couldn't receive interrupts
from that pass-thru device. Dumping IRTE, we found the corresponding
IRTE is set to posted format with "vector" field as 0.

We would fall into this issue when guest used the pirq format of MSI
(see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id'
is repurposed, skip migration which is based on 'dest_id'.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 xen/drivers/passthrough/io.c | 68 ++++++++++++++++++++++++++++----------------
 1 file changed, 43 insertions(+), 25 deletions(-)

Jan Beulich April 30, 2019, 7:56 a.m. UTC | #1

>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote:
> When testing with an UP guest with a pass-thru device with vt-d pi
> enabled in host, we observed that guest couldn't receive interrupts
> from that pass-thru device. Dumping IRTE, we found the corresponding
> IRTE is set to posted format with "vector" field as 0.
> 
> We would fall into this issue when guest used the pirq format of MSI
> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id'
> is repurposed, skip migration which is based on 'dest_id'.

I've gone through all uses of gvec, and I couldn't find any existing
special casing of it being zero. I assume this is actually communication
between the kernel and qemu, in which case I'd like to see an
explanation of why the issue needs to be addressed in Xen rather
than qemu. Otherwise, if I've overlooked something, would you
mind pointing out where such special casing lives in Xen?

In any event it doesn't look correct to skip migration altogether in
that case. I'd rather expect it to require getting done differently.
After all there still is a (CPU, vector) tuple associated with that
{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is
posted.

Finally it hasn't become clear to me why this is a UP-guest issue
only.

> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -413,34 +413,52 @@ int pt_irq_create_bind(
>                  pirq_dpci->gmsi.gflags = gflags;
>              }
>          }
> -        /* Calculate dest_vcpu_id for MSI-type pirq migration. */
> -        dest = MASK_EXTR(pirq_dpci->gmsi.gflags,
> -                         XEN_DOMCTL_VMSI_X86_DEST_ID_MASK);
> -        dest_mode = pirq_dpci->gmsi.gflags & XEN_DOMCTL_VMSI_X86_DM_MASK;
> -        delivery_mode = MASK_EXTR(pirq_dpci->gmsi.gflags,
> -                                  XEN_DOMCTL_VMSI_X86_DELIV_MASK);
> -
> -        dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
> -        pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
> -        spin_unlock(&d->event_lock);
> -
> -        pirq_dpci->gmsi.posted = false;
> -        vcpu = (dest_vcpu_id >= 0) ? d->vcpu[dest_vcpu_id] : NULL;
> -        if ( iommu_intpost )
> +        /*
> +         * Migrate pirq and create posted format IRTE only if we know the gmsi's
> +         * dest_id and vector.
> +         */
> +        if ( pirq_dpci->gmsi.gvec )

If we're to go with this hypervisor side change, please insert a
blank line above the comment.

Jan

Chao Gao April 30, 2019, 9:01 a.m. UTC | #2

On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote:
>>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote:
>> When testing with an UP guest with a pass-thru device with vt-d pi
>> enabled in host, we observed that guest couldn't receive interrupts
>> from that pass-thru device. Dumping IRTE, we found the corresponding
>> IRTE is set to posted format with "vector" field as 0.
>> 
>> We would fall into this issue when guest used the pirq format of MSI
>> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id'
>> is repurposed, skip migration which is based on 'dest_id'.
>
>I've gone through all uses of gvec, and I couldn't find any existing
>special casing of it being zero. I assume this is actually communication
>between the kernel and qemu,

Yes. 

>in which case I'd like to see an
>explanation of why the issue needs to be addressed in Xen rather
>than qemu.

To call pirq_guest_bind() to configure irq_desc properly.
Especially, we append a pointer of struct domain to 'action->guest' in
pirq_guest_bind(). Then __do_IRQ_guest() knows domains that are interested
in this interrupt and injects an interrupt to those domains.

>Otherwise, if I've overlooked something, would you
>mind pointing out where such special casing lives in Xen?
>
>In any event it doesn't look correct to skip migration altogether in
>that case. I'd rather expect it to require getting done differently.
>After all there still is a (CPU, vector) tuple associated with that
>{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is
>posted.

Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu
is running on to reduce IPI. But the 'dest_id' field which used to
indicate the vmsi's target vcpu is missing, we don't know which cpu we should
migrate the irq to. One possible choice is the 'chn->notify_vcpu_id'
used in send_guest_pirq(). Do you think this choice is fine?

>
>Finally it hasn't become clear to me why this is a UP-guest issue
>only.

For SMP case, it happens to work. hvm_girq_dest_2_vcpu_id() would return
-1 for SMP in most cases. And then we won't use VT-d PI if there is no
dest vcpu. For UP case, hvm_girq_dest_2_vcpu_id() returns 0 without matching.

>
>> --- a/xen/drivers/passthrough/io.c
>> +++ b/xen/drivers/passthrough/io.c
>> @@ -413,34 +413,52 @@ int pt_irq_create_bind(
>>                  pirq_dpci->gmsi.gflags = gflags;
>>              }
>>          }
>> -        /* Calculate dest_vcpu_id for MSI-type pirq migration. */
>> -        dest = MASK_EXTR(pirq_dpci->gmsi.gflags,
>> -                         XEN_DOMCTL_VMSI_X86_DEST_ID_MASK);
>> -        dest_mode = pirq_dpci->gmsi.gflags & XEN_DOMCTL_VMSI_X86_DM_MASK;
>> -        delivery_mode = MASK_EXTR(pirq_dpci->gmsi.gflags,
>> -                                  XEN_DOMCTL_VMSI_X86_DELIV_MASK);
>> -
>> -        dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
>> -        pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
>> -        spin_unlock(&d->event_lock);
>> -
>> -        pirq_dpci->gmsi.posted = false;
>> -        vcpu = (dest_vcpu_id >= 0) ? d->vcpu[dest_vcpu_id] : NULL;
>> -        if ( iommu_intpost )
>> +        /*
>> +         * Migrate pirq and create posted format IRTE only if we know the gmsi's
>> +         * dest_id and vector.
>> +         */
>> +        if ( pirq_dpci->gmsi.gvec )
>
>If we're to go with this hypervisor side change, please insert a
>blank line above the comment.

Will do

Thanks
Chao

Roger Pau Monné April 30, 2019, 9:08 a.m. UTC | #3

On Tue, Apr 30, 2019 at 01:19:19PM +0800, Chao Gao wrote:
> When testing with an UP guest with a pass-thru device with vt-d pi
> enabled in host, we observed that guest couldn't receive interrupts
> from that pass-thru device. Dumping IRTE, we found the corresponding
> IRTE is set to posted format with "vector" field as 0.
> 
> We would fall into this issue when guest used the pirq format of MSI

I think the above sentence is a bit confusing. I would rather say that
the guest is routing interrupts from passthrough devices over event
channels using pirqs.

> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id'
> is repurposed, skip migration which is based on 'dest_id'.

Like Jan, I wonder why the device model (I assume QEMU) issues a
XEN_DOMCTL_bind_pt_irq hypercall when the guest is attempting to route
this interrupts over an event channel, attempting to bind it to a
vector seems like a bug in the device model.

I would also consider whether it would make sense to not expose the
XENFEAT_hvm_pirqs feature when doing passthrough if posted interrupts
are supported, performance wise it's better to use posted interrupts
rather than routing them over event channels (which requires Xen
interaction).

Note that I think this whole XENFEAT_hvm_pirqs is a big hack which
abuses a hardware interface, and I would like to see it removed.

Thanks, Roger.

Roger Pau Monné April 30, 2019, 9:30 a.m. UTC | #4

On Tue, Apr 30, 2019 at 05:01:21PM +0800, Chao Gao wrote:
> On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote:
> >>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote:
> >> When testing with an UP guest with a pass-thru device with vt-d pi
> >> enabled in host, we observed that guest couldn't receive interrupts
> >> from that pass-thru device. Dumping IRTE, we found the corresponding
> >> IRTE is set to posted format with "vector" field as 0.
> >> 
> >> We would fall into this issue when guest used the pirq format of MSI
> >> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id'
> >> is repurposed, skip migration which is based on 'dest_id'.
> >
> >I've gone through all uses of gvec, and I couldn't find any existing
> >special casing of it being zero. I assume this is actually communication
> >between the kernel and qemu,
> 
> Yes. 
> 
> >in which case I'd like to see an
> >explanation of why the issue needs to be addressed in Xen rather
> >than qemu.
> 
> To call pirq_guest_bind() to configure irq_desc properly.
> Especially, we append a pointer of struct domain to 'action->guest' in
> pirq_guest_bind(). Then __do_IRQ_guest() knows domains that are interested
> in this interrupt and injects an interrupt to those domains.
> 
> >Otherwise, if I've overlooked something, would you
> >mind pointing out where such special casing lives in Xen?
> >
> >In any event it doesn't look correct to skip migration altogether in
> >that case. I'd rather expect it to require getting done differently.
> >After all there still is a (CPU, vector) tuple associated with that
> >{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is
> >posted.
> 
> Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu
> is running on to reduce IPI. But the 'dest_id' field which used to
> indicate the vmsi's target vcpu is missing, we don't know which cpu we should
> migrate the irq to. One possible choice is the 'chn->notify_vcpu_id'
> used in send_guest_pirq(). Do you think this choice is fine?

I think that by the time the device model calls into pirq_guest_bind
the PIRQ won't be bound to any event channel, so pirq->evtchn would be
0.

Note that the binding of the PIRQ with the event channel is done
afterwards in xen_hvm_setup_msi_irqs by the Linux kernel.

It seems like the device model should be using a different set of
hypercalls to setup a PIRQ that is routed over an event channel, ie:
PHYSDEVOP_map_pirq and friends.

Thanks, Roger.

Jan Beulich April 30, 2019, 9:54 a.m. UTC | #5

>>> On 30.04.19 at 11:01, <chao.gao@intel.com> wrote:
> On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote:
>>In any event it doesn't look correct to skip migration altogether in
>>that case. I'd rather expect it to require getting done differently.
>>After all there still is a (CPU, vector) tuple associated with that
>>{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is
>>posted.
> 
> Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu
> is running on to reduce IPI. But the 'dest_id' field which used to
> indicate the vmsi's target vcpu is missing, we don't know which cpu we should
> migrate the irq to. One possible choice is the 'chn->notify_vcpu_id'
> used in send_guest_pirq(). Do you think this choice is fine?

Well, the code you change is that binding the IRQ, not delivering
it. Is the event channel set up already at all at that point? See
also Roger's reply.

>>Finally it hasn't become clear to me why this is a UP-guest issue
>>only.
> 
> For SMP case, it happens to work. hvm_girq_dest_2_vcpu_id() would return
> -1 for SMP in most cases. And then we won't use VT-d PI if there is no
> dest vcpu. For UP case, hvm_girq_dest_2_vcpu_id() returns 0 without 
> matching.

"Happens to work" together with how hvm_girq_dest_2_vcpu_id()
works looks to mean "happens to sometimes work" (maybe "often",
but hardly "always"): The function easily can return other than -1
for multi-CPU guests as well. Which then is another reason to
consider fixing the issue in qemu instead.

Jan

Chao Gao April 30, 2019, 4:24 p.m. UTC | #6

On Tue, Apr 30, 2019 at 11:08:54AM +0200, Roger Pau Monné wrote:
>On Tue, Apr 30, 2019 at 01:19:19PM +0800, Chao Gao wrote:
>> When testing with an UP guest with a pass-thru device with vt-d pi
>> enabled in host, we observed that guest couldn't receive interrupts
>> from that pass-thru device. Dumping IRTE, we found the corresponding
>> IRTE is set to posted format with "vector" field as 0.
>> 
>> We would fall into this issue when guest used the pirq format of MSI
>
>I think the above sentence is a bit confusing. I would rather say that
>the guest is routing interrupts from passthrough devices over event
>channels using pirqs.

Yes. It is better than it was.

>
>> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id'
>> is repurposed, skip migration which is based on 'dest_id'.
>
>Like Jan, I wonder why the device model (I assume QEMU) issues a
>XEN_DOMCTL_bind_pt_irq hypercall when the guest is attempting to route
>this interrupts over an event channel, attempting to bind it to a
>vector seems like a bug in the device model.
>
>I would also consider whether it would make sense to not expose the
>XENFEAT_hvm_pirqs feature when doing passthrough if posted interrupts
>are supported, performance wise it's better to use posted interrupts
>rather than routing them over event channels (which requires Xen
>interaction).

It is reasonable. But I think currently guest can add "xen_nopv" to its
kernel boot options to avoid using evtchn. There might be some cases
even with pass-thru devices (such as, guest uses polling mode
driver for pass-thru devices), we care more about the efficiency of
delivering virtual interrupts. So a separate option for evtchn in guest
kernel seems like a better choice.

Thanks
Chao

>
>Note that I think this whole XENFEAT_hvm_pirqs is a big hack which
>abuses a hardware interface, and I would like to see it removed.
>
>Thanks, Roger.
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xenproject.org
>https://lists.xenproject.org/mailman/listinfo/xen-devel

Chao Gao April 30, 2019, 4:41 p.m. UTC | #7

On Tue, Apr 30, 2019 at 11:30:33AM +0200, Roger Pau Monné wrote:
>On Tue, Apr 30, 2019 at 05:01:21PM +0800, Chao Gao wrote:
>> On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote:
>> >>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote:
>> >> When testing with an UP guest with a pass-thru device with vt-d pi
>> >> enabled in host, we observed that guest couldn't receive interrupts
>> >> from that pass-thru device. Dumping IRTE, we found the corresponding
>> >> IRTE is set to posted format with "vector" field as 0.
>> >> 
>> >> We would fall into this issue when guest used the pirq format of MSI
>> >> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id'
>> >> is repurposed, skip migration which is based on 'dest_id'.
>> >
>> >I've gone through all uses of gvec, and I couldn't find any existing
>> >special casing of it being zero. I assume this is actually communication
>> >between the kernel and qemu,
>> 
>> Yes. 
>> 
>> >in which case I'd like to see an
>> >explanation of why the issue needs to be addressed in Xen rather
>> >than qemu.
>> 
>> To call pirq_guest_bind() to configure irq_desc properly.
>> Especially, we append a pointer of struct domain to 'action->guest' in
>> pirq_guest_bind(). Then __do_IRQ_guest() knows domains that are interested
>> in this interrupt and injects an interrupt to those domains.
>> 
>> >Otherwise, if I've overlooked something, would you
>> >mind pointing out where such special casing lives in Xen?
>> >
>> >In any event it doesn't look correct to skip migration altogether in
>> >that case. I'd rather expect it to require getting done differently.
>> >After all there still is a (CPU, vector) tuple associated with that
>> >{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is
>> >posted.
>> 
>> Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu
>> is running on to reduce IPI. But the 'dest_id' field which used to
>> indicate the vmsi's target vcpu is missing, we don't know which cpu we should
>> migrate the irq to. One possible choice is the 'chn->notify_vcpu_id'
>> used in send_guest_pirq(). Do you think this choice is fine?
>
>I think that by the time the device model calls into pirq_guest_bind
>the PIRQ won't be bound to any event channel, so pirq->evtchn would be
>0.

Then skip pirq migration is the only choice here? And we can migrate
pirq when it is bound with an event channel.

>
>Note that the binding of the PIRQ with the event channel is done
>afterwards in xen_hvm_setup_msi_irqs by the Linux kernel.
>
>It seems like the device model should be using a different set of
>hypercalls to setup a PIRQ that is routed over an event channel, ie:
>PHYSDEVOP_map_pirq and friends.

Now qemu is using PHYSDEVOP_map_pirq. Right?

Thanks
Chao

Andrew Cooper April 30, 2019, 4:48 p.m. UTC | #8

On 30/04/2019 17:24, Chao Gao wrote:
> On Tue, Apr 30, 2019 at 11:08:54AM +0200, Roger Pau Monné wrote:
>> On Tue, Apr 30, 2019 at 01:19:19PM +0800, Chao Gao wrote:
>>> When testing with an UP guest with a pass-thru device with vt-d pi
>>> enabled in host, we observed that guest couldn't receive interrupts
>>> from that pass-thru device. Dumping IRTE, we found the corresponding
>>> IRTE is set to posted format with "vector" field as 0.
>>>
>>> We would fall into this issue when guest used the pirq format of MSI
>> I think the above sentence is a bit confusing. I would rather say that
>> the guest is routing interrupts from passthrough devices over event
>> channels using pirqs.
> Yes. It is better than it was.
>
>>> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id'
>>> is repurposed, skip migration which is based on 'dest_id'.
>> Like Jan, I wonder why the device model (I assume QEMU) issues a
>> XEN_DOMCTL_bind_pt_irq hypercall when the guest is attempting to route
>> this interrupts over an event channel, attempting to bind it to a
>> vector seems like a bug in the device model.
>>
>> I would also consider whether it would make sense to not expose the
>> XENFEAT_hvm_pirqs feature when doing passthrough if posted interrupts
>> are supported, performance wise it's better to use posted interrupts
>> rather than routing them over event channels (which requires Xen
>> interaction).
> It is reasonable. But I think currently guest can add "xen_nopv" to its
> kernel boot options to avoid using evtchn. There might be some cases
> even with pass-thru devices (such as, guest uses polling mode
> driver for pass-thru devices), we care more about the efficiency of
> delivering virtual interrupts. So a separate option for evtchn in guest
> kernel seems like a better choice.

This is a festering swamp, and is going to need some careful
architectural work to untangle.

At the moment, guests which are xen-aware will try to use event channels
for everything.  I'm pretty sure we've got some ABI incompatibilities
here with hardware APIC acceleration, and I still have yet(!) to
diagnose the underlying problem which is preventing XenServer from
enabling hardware APIC acceleration.  (VMs still intermittently wedge on
migrate with what appears to be lost interrupt.)

The legacy HVM callback via single vector is incompatible with hardware
APIC acceleration, because it exists specifically to skip going through
the IRR.  I can't think a proper solution to this, other than crashing
the guest when it tries to set such a situation up.

From an "architecturally ideal" point of view, we obviously want to be
using APICV and/or Posted interrupt support whenever possible, and be
using these methods in preference to event channels.

Injecting interrupts (including signalling an event upcall) using the
PIR is more efficient than other means, so long as we can ensure that
guest will handle the vector like any other edge triggered interrupt,
and ack it at the LAPIC.  (This is where the legacy callback via method
breaks down, as it was intentionally designed to be used without an ack
at the LAPIC.)

As soon as we get to device interrupts, posted is definitely the way to
go, especially as the guest should not be aware of the routing behind
the scenes in the first place.

To be clear, Chao - I'm not asking you to fix this.  This is a decade of
technical debt which has been steadily growing, and it needs to start
with a coherent plan what the current state of play is, and how we'd
like to proceed.

~Andrew

Roger Pau Monné May 2, 2019, 8:20 a.m. UTC | #9

On Wed, May 01, 2019 at 12:41:13AM +0800, Chao Gao wrote:
> On Tue, Apr 30, 2019 at 11:30:33AM +0200, Roger Pau Monné wrote:
> >On Tue, Apr 30, 2019 at 05:01:21PM +0800, Chao Gao wrote:
> >> On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote:
> >> >>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote:
> >> >> When testing with an UP guest with a pass-thru device with vt-d pi
> >> >> enabled in host, we observed that guest couldn't receive interrupts
> >> >> from that pass-thru device. Dumping IRTE, we found the corresponding
> >> >> IRTE is set to posted format with "vector" field as 0.
> >> >> 
> >> >> We would fall into this issue when guest used the pirq format of MSI
> >> >> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id'
> >> >> is repurposed, skip migration which is based on 'dest_id'.
> >> >
> >> >I've gone through all uses of gvec, and I couldn't find any existing
> >> >special casing of it being zero. I assume this is actually communication
> >> >between the kernel and qemu,
> >> 
> >> Yes. 
> >> 
> >> >in which case I'd like to see an
> >> >explanation of why the issue needs to be addressed in Xen rather
> >> >than qemu.
> >> 
> >> To call pirq_guest_bind() to configure irq_desc properly.
> >> Especially, we append a pointer of struct domain to 'action->guest' in
> >> pirq_guest_bind(). Then __do_IRQ_guest() knows domains that are interested
> >> in this interrupt and injects an interrupt to those domains.
> >> 
> >> >Otherwise, if I've overlooked something, would you
> >> >mind pointing out where such special casing lives in Xen?
> >> >
> >> >In any event it doesn't look correct to skip migration altogether in
> >> >that case. I'd rather expect it to require getting done differently.
> >> >After all there still is a (CPU, vector) tuple associated with that
> >> >{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is
> >> >posted.
> >> 
> >> Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu
> >> is running on to reduce IPI. But the 'dest_id' field which used to
> >> indicate the vmsi's target vcpu is missing, we don't know which cpu we should
> >> migrate the irq to. One possible choice is the 'chn->notify_vcpu_id'
> >> used in send_guest_pirq(). Do you think this choice is fine?
> >
> >I think that by the time the device model calls into pirq_guest_bind
> >the PIRQ won't be bound to any event channel, so pirq->evtchn would be
> >0.
> 
> Then skip pirq migration is the only choice here? And we can migrate
> pirq when it is bound with an event channel.
> 
> >
> >Note that the binding of the PIRQ with the event channel is done
> >afterwards in xen_hvm_setup_msi_irqs by the Linux kernel.
> >
> >It seems like the device model should be using a different set of
> >hypercalls to setup a PIRQ that is routed over an event channel, ie:
> >PHYSDEVOP_map_pirq and friends.
> 
> Now qemu is using PHYSDEVOP_map_pirq. Right?

Oh yes, QEMU already uses PHYSDEVOP_map_pirq to setup the interrupt.
Then I'm not sure I see why QEMU calls XEN_DOMCTL_bind_pt_irq for
interrupts that are routed over event channels. That hypercall is used
to bind a pirq to a native guest interrupt injection mechanism, which
shouldn't be used if the interrupt is going to be delivered over an
event channel.

Can you see about avoiding the XEN_DOMCTL_bind_pt_irq call in QEMU if
the interrupt is going to be routed over an event channel?

That would avoid having to add more quirks to the hypercall
implementation.

Thanks, Roger.

Roger Pau Monné May 2, 2019, 8:35 a.m. UTC | #10

On Tue, Apr 30, 2019 at 05:48:14PM +0100, Andrew Cooper wrote:
> On 30/04/2019 17:24, Chao Gao wrote:
> > On Tue, Apr 30, 2019 at 11:08:54AM +0200, Roger Pau Monné wrote:
> >> On Tue, Apr 30, 2019 at 01:19:19PM +0800, Chao Gao wrote:
> >>> When testing with an UP guest with a pass-thru device with vt-d pi
> >>> enabled in host, we observed that guest couldn't receive interrupts
> >>> from that pass-thru device. Dumping IRTE, we found the corresponding
> >>> IRTE is set to posted format with "vector" field as 0.
> >>>
> >>> We would fall into this issue when guest used the pirq format of MSI
> >> I think the above sentence is a bit confusing. I would rather say that
> >> the guest is routing interrupts from passthrough devices over event
> >> channels using pirqs.
> > Yes. It is better than it was.
> >
> >>> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id'
> >>> is repurposed, skip migration which is based on 'dest_id'.
> >> Like Jan, I wonder why the device model (I assume QEMU) issues a
> >> XEN_DOMCTL_bind_pt_irq hypercall when the guest is attempting to route
> >> this interrupts over an event channel, attempting to bind it to a
> >> vector seems like a bug in the device model.
> >>
> >> I would also consider whether it would make sense to not expose the
> >> XENFEAT_hvm_pirqs feature when doing passthrough if posted interrupts
> >> are supported, performance wise it's better to use posted interrupts
> >> rather than routing them over event channels (which requires Xen
> >> interaction).
> > It is reasonable. But I think currently guest can add "xen_nopv" to its
> > kernel boot options to avoid using evtchn. There might be some cases
> > even with pass-thru devices (such as, guest uses polling mode
> > driver for pass-thru devices), we care more about the efficiency of
> > delivering virtual interrupts. So a separate option for evtchn in guest
> > kernel seems like a better choice.
> 
> This is a festering swamp, and is going to need some careful
> architectural work to untangle.
> 
> At the moment, guests which are xen-aware will try to use event channels
> for everything.  I'm pretty sure we've got some ABI incompatibilities
> here with hardware APIC acceleration, and I still have yet(!) to
> diagnose the underlying problem which is preventing XenServer from
> enabling hardware APIC acceleration.  (VMs still intermittently wedge on
> migrate with what appears to be lost interrupt.)

My opinion is that XENFEAT_hvm_pirqs should be removed:

 - It abuses of a hardware interface to pass Xen specific information.
   When routing pirqs over event channels for HVM the pirq is passed
   on the MSI address field, hijacking the native dest_id field.
 - Cannot be used with posted interrupts or APICV.

Iff routing physical interrupts over event channels is still a useful
thing for HVM, it should be done using hypercalls, like it's done on a
PV dom0.

> 
> The legacy HVM callback via single vector is incompatible with hardware
> APIC acceleration, because it exists specifically to skip going through
> the IRR.  I can't think a proper solution to this, other than crashing
> the guest when it tries to set such a situation up.
> 
> From an "architecturally ideal" point of view, we obviously want to be
> using APICV and/or Posted interrupt support whenever possible, and be
> using these methods in preference to event channels.
> 
> Injecting interrupts (including signalling an event upcall) using the
> PIR is more efficient than other means, so long as we can ensure that
> guest will handle the vector like any other edge triggered interrupt,
> and ack it at the LAPIC.  (This is where the legacy callback via method
> breaks down, as it was intentionally designed to be used without an ack
> at the LAPIC.)

IIRC there's already a correct way to setup an event channel upcall
against a vector using HVMOP_set_evtchn_upcall_vector which requires
an APIC ack. Now the issue is switching existing users to this new
callback setup mechanism.

> As soon as we get to device interrupts, posted is definitely the way to
> go, especially as the guest should not be aware of the routing behind
> the scenes in the first place.
> 
> To be clear, Chao - I'm not asking you to fix this.  This is a decade of
> technical debt which has been steadily growing, and it needs to start
> with a coherent plan what the current state of play is, and how we'd
> like to proceed.

I agree. I've suggested a possible fix for the problem at hand, but as
stated above I think XENFEAT_hvm_pirqs should be removed.

Thanks, Roger.

Chao Gao May 6, 2019, 4:44 a.m. UTC | #11

On Thu, May 02, 2019 at 10:20:09AM +0200, Roger Pau Monné wrote:
>On Wed, May 01, 2019 at 12:41:13AM +0800, Chao Gao wrote:
>> On Tue, Apr 30, 2019 at 11:30:33AM +0200, Roger Pau Monné wrote:
>> >On Tue, Apr 30, 2019 at 05:01:21PM +0800, Chao Gao wrote:
>> >> On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote:
>> >> >>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote:
>> >> >> When testing with an UP guest with a pass-thru device with vt-d pi
>> >> >> enabled in host, we observed that guest couldn't receive interrupts
>> >> >> from that pass-thru device. Dumping IRTE, we found the corresponding
>> >> >> IRTE is set to posted format with "vector" field as 0.
>> >> >> 
>> >> >> We would fall into this issue when guest used the pirq format of MSI
>> >> >> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id'
>> >> >> is repurposed, skip migration which is based on 'dest_id'.
>> >> >
>> >> >I've gone through all uses of gvec, and I couldn't find any existing
>> >> >special casing of it being zero. I assume this is actually communication
>> >> >between the kernel and qemu,
>> >> 
>> >> Yes. 
>> >> 
>> >> >in which case I'd like to see an
>> >> >explanation of why the issue needs to be addressed in Xen rather
>> >> >than qemu.
>> >> 
>> >> To call pirq_guest_bind() to configure irq_desc properly.
>> >> Especially, we append a pointer of struct domain to 'action->guest' in
>> >> pirq_guest_bind(). Then __do_IRQ_guest() knows domains that are interested
>> >> in this interrupt and injects an interrupt to those domains.
>> >> 
>> >> >Otherwise, if I've overlooked something, would you
>> >> >mind pointing out where such special casing lives in Xen?
>> >> >
>> >> >In any event it doesn't look correct to skip migration altogether in
>> >> >that case. I'd rather expect it to require getting done differently.
>> >> >After all there still is a (CPU, vector) tuple associated with that
>> >> >{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is
>> >> >posted.
>> >> 
>> >> Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu
>> >> is running on to reduce IPI. But the 'dest_id' field which used to
>> >> indicate the vmsi's target vcpu is missing, we don't know which cpu we should
>> >> migrate the irq to. One possible choice is the 'chn->notify_vcpu_id'
>> >> used in send_guest_pirq(). Do you think this choice is fine?
>> >
>> >I think that by the time the device model calls into pirq_guest_bind
>> >the PIRQ won't be bound to any event channel, so pirq->evtchn would be
>> >0.
>> 
>> Then skip pirq migration is the only choice here? And we can migrate
>> pirq when it is bound with an event channel.
>> 
>> >
>> >Note that the binding of the PIRQ with the event channel is done
>> >afterwards in xen_hvm_setup_msi_irqs by the Linux kernel.
>> >
>> >It seems like the device model should be using a different set of
>> >hypercalls to setup a PIRQ that is routed over an event channel, ie:
>> >PHYSDEVOP_map_pirq and friends.
>> 
>> Now qemu is using PHYSDEVOP_map_pirq. Right?
>
>Oh yes, QEMU already uses PHYSDEVOP_map_pirq to setup the interrupt.
>Then I'm not sure I see why QEMU calls XEN_DOMCTL_bind_pt_irq for
>interrupts that are routed over event channels. That hypercall is used

As I said above, it is to call pirq_guest_bind() to hook up to irq handler.

XEN_DOMCTL_bind_pt_pirq does two things:
#1. bind pirq with a guest interrupt
#2. register (domain,pirq) to the interrupt handler

currently, for pirq routed to evtchn, #1 is done by another hypercall,
evtchn_bind_pirq. and #2 is done in XEN_DOMCTL_bind_pt_irq.

>to bind a pirq to a native guest interrupt injection mechanism, which
>shouldn't be used if the interrupt is going to be delivered over an
>event channel.
>
>Can you see about avoiding the XEN_DOMCTL_bind_pt_irq call in QEMU if
>the interrupt is going to be routed over an event channel?

Yes. It is doable. But it needs changes in both qemu and Xen and some tricks
to be compatible with old qemu.

I prefer not to touch qemu and keep qemu unware of MSI's "routing over evtchn",
like the patch below:

diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
index e86e2bf..0bcddb9 100644
--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -504,10 +504,7 @@ static long evtchn_bind_pirq(evtchn_bind_pirq_t *bind)
     if ( !info )
         ERROR_EXIT(-ENOMEM);
     info->evtchn = port;
-    rc = (!is_hvm_domain(d)
-          ? pirq_guest_bind(v, info,
-                            !!(bind->flags & BIND_PIRQ__WILL_SHARE))
-          : 0);
+    rc = pirq_guest_bind(v, info, !!(bind->flags & BIND_PIRQ__WILL_SHARE));
     if ( rc != 0 )
     {
         info->evtchn = 0;
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 4290c7c..5a0b83e 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -346,6 +346,12 @@ int pt_irq_create_bind(
         uint32_t gflags = pt_irq_bind->u.msi.gflags &
                           ~XEN_DOMCTL_VMSI_X86_UNMASKED;
 
+        if ( !pt_irq_bind->u.msi.gvec )
+        {
+            spin_unlock(&d->event_lock);
+            return 0;
+        }
+
         if ( !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         {
             pirq_dpci->flags = HVM_IRQ_DPCI_MAPPED | HVM_IRQ_DPCI_MACH_MSI |


Thanks
Chao

Jan Beulich May 6, 2019, 9:39 a.m. UTC | #12

>>> On 06.05.19 at 06:44, <chao.gao@intel.com> wrote:
> On Thu, May 02, 2019 at 10:20:09AM +0200, Roger Pau Monné wrote:
>>Can you see about avoiding the XEN_DOMCTL_bind_pt_irq call in QEMU if
>>the interrupt is going to be routed over an event channel?
> 
> Yes. It is doable. But it needs changes in both qemu and Xen and some tricks
> to be compatible with old qemu.

That would be ugly indeed.

> I prefer not to touch qemu and keep qemu unware of MSI's "routing over evtchn",
> like the patch below:

Is this meant as a replacement to your original patch, or as an
add-on? In any event it's not immediately clear to me how
...

> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -504,10 +504,7 @@ static long evtchn_bind_pirq(evtchn_bind_pirq_t *bind)
>      if ( !info )
>          ERROR_EXIT(-ENOMEM);
>      info->evtchn = port;
> -    rc = (!is_hvm_domain(d)
> -          ? pirq_guest_bind(v, info,
> -                            !!(bind->flags & BIND_PIRQ__WILL_SHARE))
> -          : 0);
> +    rc = pirq_guest_bind(v, info, !!(bind->flags & BIND_PIRQ__WILL_SHARE));

... this becoming unconditional won't conflict with its other
invocation ...

> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -346,6 +346,12 @@ int pt_irq_create_bind(
>          uint32_t gflags = pt_irq_bind->u.msi.gflags &
>                            ~XEN_DOMCTL_VMSI_X86_UNMASKED;
>  
> +        if ( !pt_irq_bind->u.msi.gvec )
> +        {
> +            spin_unlock(&d->event_lock);
> +            return 0;
> +        }

... further down in this function, for the non-MSI case.
Similarly I wonder whether the respective unbind function
invocations then won't go (or already are?) out of sync.

Jan

Roger Pau Monné May 6, 2019, 10:25 a.m. UTC | #13

On Mon, May 06, 2019 at 12:44:41PM +0800, Chao Gao wrote:
> On Thu, May 02, 2019 at 10:20:09AM +0200, Roger Pau Monné wrote:
> >On Wed, May 01, 2019 at 12:41:13AM +0800, Chao Gao wrote:
> >> On Tue, Apr 30, 2019 at 11:30:33AM +0200, Roger Pau Monné wrote:
> >> >On Tue, Apr 30, 2019 at 05:01:21PM +0800, Chao Gao wrote:
> >> >> On Tue, Apr 30, 2019 at 01:56:31AM -0600, Jan Beulich wrote:
> >> >> >>>> On 30.04.19 at 07:19, <chao.gao@intel.com> wrote:
> >> >> >> When testing with an UP guest with a pass-thru device with vt-d pi
> >> >> >> enabled in host, we observed that guest couldn't receive interrupts
> >> >> >> from that pass-thru device. Dumping IRTE, we found the corresponding
> >> >> >> IRTE is set to posted format with "vector" field as 0.
> >> >> >> 
> >> >> >> We would fall into this issue when guest used the pirq format of MSI
> >> >> >> (see the comment xen_msi_compose_msg() in linux kernel). As 'dest_id'
> >> >> >> is repurposed, skip migration which is based on 'dest_id'.
> >> >> >
> >> >> >I've gone through all uses of gvec, and I couldn't find any existing
> >> >> >special casing of it being zero. I assume this is actually communication
> >> >> >between the kernel and qemu,
> >> >> 
> >> >> Yes. 
> >> >> 
> >> >> >in which case I'd like to see an
> >> >> >explanation of why the issue needs to be addressed in Xen rather
> >> >> >than qemu.
> >> >> 
> >> >> To call pirq_guest_bind() to configure irq_desc properly.
> >> >> Especially, we append a pointer of struct domain to 'action->guest' in
> >> >> pirq_guest_bind(). Then __do_IRQ_guest() knows domains that are interested
> >> >> in this interrupt and injects an interrupt to those domains.
> >> >> 
> >> >> >Otherwise, if I've overlooked something, would you
> >> >> >mind pointing out where such special casing lives in Xen?
> >> >> >
> >> >> >In any event it doesn't look correct to skip migration altogether in
> >> >> >that case. I'd rather expect it to require getting done differently.
> >> >> >After all there still is a (CPU, vector) tuple associated with that
> >> >> >{,p}IRQ if it's not posted, and hvm_migrate_pirq() is a no-op if it is
> >> >> >posted.
> >> >> 
> >> >> Here, we try to set irq's target cpu to the cpu which the vmsi's target vcpu
> >> >> is running on to reduce IPI. But the 'dest_id' field which used to
> >> >> indicate the vmsi's target vcpu is missing, we don't know which cpu we should
> >> >> migrate the irq to. One possible choice is the 'chn->notify_vcpu_id'
> >> >> used in send_guest_pirq(). Do you think this choice is fine?
> >> >
> >> >I think that by the time the device model calls into pirq_guest_bind
> >> >the PIRQ won't be bound to any event channel, so pirq->evtchn would be
> >> >0.
> >> 
> >> Then skip pirq migration is the only choice here? And we can migrate
> >> pirq when it is bound with an event channel.
> >> 
> >> >
> >> >Note that the binding of the PIRQ with the event channel is done
> >> >afterwards in xen_hvm_setup_msi_irqs by the Linux kernel.
> >> >
> >> >It seems like the device model should be using a different set of
> >> >hypercalls to setup a PIRQ that is routed over an event channel, ie:
> >> >PHYSDEVOP_map_pirq and friends.
> >> 
> >> Now qemu is using PHYSDEVOP_map_pirq. Right?
> >
> >Oh yes, QEMU already uses PHYSDEVOP_map_pirq to setup the interrupt.
> >Then I'm not sure I see why QEMU calls XEN_DOMCTL_bind_pt_irq for
> >interrupts that are routed over event channels. That hypercall is used
> 
> As I said above, it is to call pirq_guest_bind() to hook up to irq handler.
> 
> XEN_DOMCTL_bind_pt_pirq does two things:
> #1. bind pirq with a guest interrupt
> #2. register (domain,pirq) to the interrupt handler
> 
> currently, for pirq routed to evtchn, #1 is done by another hypercall,
> evtchn_bind_pirq. and #2 is done in XEN_DOMCTL_bind_pt_irq.

So XEN_DOMCTL_bind_pt_irq basically does the pirq_guest_bind in this
case, and that's why the call to pirq_guest_bind is avoided in
evtchn_bind_pirq for HVM guests.

> 
> >to bind a pirq to a native guest interrupt injection mechanism, which
> >shouldn't be used if the interrupt is going to be delivered over an
> >event channel.
> >
> >Can you see about avoiding the XEN_DOMCTL_bind_pt_irq call in QEMU if
> >the interrupt is going to be routed over an event channel?
> 
> Yes. It is doable. But it needs changes in both qemu and Xen and some tricks
> to be compatible with old qemu.

OK, leaving the XEN_DOMCTL_bind_pt_irq call in QEMU and making it a
no-op in this case seems like a good compromise solution IMO.

It might be helpful to add a comment to the QEMU code noting that the
XEN_DOMCTL_bind_pt_irq hypercall is not needed when routing pirqs over
event channels for HVM guests with Xen >= 4.13.

> I prefer not to touch qemu and keep qemu unware of MSI's "routing over evtchn",
> like the patch below:
> 
> diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
> index e86e2bf..0bcddb9 100644
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -504,10 +504,7 @@ static long evtchn_bind_pirq(evtchn_bind_pirq_t *bind)
>      if ( !info )
>          ERROR_EXIT(-ENOMEM);
>      info->evtchn = port;
> -    rc = (!is_hvm_domain(d)
> -          ? pirq_guest_bind(v, info,
> -                            !!(bind->flags & BIND_PIRQ__WILL_SHARE))
> -          : 0);
> +    rc = pirq_guest_bind(v, info, !!(bind->flags & BIND_PIRQ__WILL_SHARE));
>      if ( rc != 0 )
>      {
>          info->evtchn = 0;
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index 4290c7c..5a0b83e 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -346,6 +346,12 @@ int pt_irq_create_bind(
>          uint32_t gflags = pt_irq_bind->u.msi.gflags &
>                            ~XEN_DOMCTL_VMSI_X86_UNMASKED;
>  
> +        if ( !pt_irq_bind->u.msi.gvec )
> +        {
> +            spin_unlock(&d->event_lock);
> +            return 0;
> +        }
> +

Can you see about short-circuiting pt_irq_create_bind much earlier?
Likely at the start of the function, AFAICT pirqs routed over event
channels don't need hvm_irq_dpci, so I think you can avoid allocating
it if all pirqs from a domain are routed over event channels.

Thanks, Roger.

Chao Gao May 8, 2019, 2:54 a.m. UTC | #14

On Mon, May 06, 2019 at 03:39:40AM -0600, Jan Beulich wrote:
>>>> On 06.05.19 at 06:44, <chao.gao@intel.com> wrote:
>> On Thu, May 02, 2019 at 10:20:09AM +0200, Roger Pau Monné wrote:
>>>Can you see about avoiding the XEN_DOMCTL_bind_pt_irq call in QEMU if
>>>the interrupt is going to be routed over an event channel?
>> 
>> Yes. It is doable. But it needs changes in both qemu and Xen and some tricks
>> to be compatible with old qemu.
>
>That would be ugly indeed.
>
>> I prefer not to touch qemu and keep qemu unware of MSI's "routing over evtchn",
>> like the patch below:
>
>Is this meant as a replacement to your original patch, or as an
>add-on? In any event it's not immediately clear to me how

A replacement.

>...
>
>> --- a/xen/common/event_channel.c
>> +++ b/xen/common/event_channel.c
>> @@ -504,10 +504,7 @@ static long evtchn_bind_pirq(evtchn_bind_pirq_t *bind)
>>      if ( !info )
>>          ERROR_EXIT(-ENOMEM);
>>      info->evtchn = port;
>> -    rc = (!is_hvm_domain(d)
>> -          ? pirq_guest_bind(v, info,
>> -                            !!(bind->flags & BIND_PIRQ__WILL_SHARE))
>> -          : 0);
>> +    rc = pirq_guest_bind(v, info, !!(bind->flags & BIND_PIRQ__WILL_SHARE));
>
>... this becoming unconditional won't conflict with its other
>invocation ...

Yes. It conflicts with the call in pt_irq_create_bind() for non-MSI case.

>
>> --- a/xen/drivers/passthrough/io.c
>> +++ b/xen/drivers/passthrough/io.c
>> @@ -346,6 +346,12 @@ int pt_irq_create_bind(
>>          uint32_t gflags = pt_irq_bind->u.msi.gflags &
>>                            ~XEN_DOMCTL_VMSI_X86_UNMASKED;
>>  
>> +        if ( !pt_irq_bind->u.msi.gvec )
>> +        {
>> +            spin_unlock(&d->event_lock);
>> +            return 0;
>> +        }
>
>... further down in this function, for the non-MSI case.
>Similarly I wonder whether the respective unbind function
>invocations then won't go (or already are?) out of sync.

The "out of sync" issue seems hard to be solved. It is error-prone to
move pirq_guest_(un)bind from one hypercall to another.

On second thought, I plan to go back to my original patch. The only
issue for that patch is how to migrate irq properly to avoid IPI during
interrupt delivery.

Actually, current code has set irq affinity correctly:
1. pirq is bound to vcpu[0] in pt_irq_create_bind(). It also sets
corresponding physical irq's affinity to vcpu[0].
2. evtchn is bound to vcpu[0] in evtchn_bind_pirq(). During
delivery, we would send pirq to vcpu[0] and no IPI is required.
3. If evtchn is rebound to another vcpu in evtchn_bind_vcpu(), the
affinity of the physical irq is already reconfigured there.

Thanks
Chao

x86/pt: skip setup of posted format IRTE when gvec is 0

Commit Message

Comments

Patch