diff mbox

[v4] kvm: Use a bitmap for tracking used GSIs

Message ID 20090513043835.6696.27384.stgit@dl380g6-3.ned.telco.ned.telco (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Williamson May 13, 2009, 4:41 a.m. UTC
We're currently using a counter to track the most recent GSI we've
handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
assignment with a driver that regularly toggles the MSI enable bit.
This can mean only a few minutes of usable run time.  Instead, track
used GSIs in a bitmap.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
---

 v2: Added mutex to protect gsi bitmap
 v3: Updated for comments from Michael Tsirkin
     No longer depends on "[PATCH] kvm: device-assignment: Catch GSI overflow"
 v4: Fix gsi_bytes calculation noted by Sheng Yang

 hw/device-assignment.c  |    4 ++
 kvm/libkvm/kvm-common.h |    4 ++
 kvm/libkvm/libkvm.c     |   81 +++++++++++++++++++++++++++++++++++++++++------
 kvm/libkvm/libkvm.h     |   10 ++++++
 4 files changed, 86 insertions(+), 13 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Yang, Sheng May 13, 2009, 4:58 a.m. UTC | #1
On Wednesday 13 May 2009 12:41:29 Alex Williamson wrote:
> We're currently using a counter to track the most recent GSI we've
> handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
> assignment with a driver that regularly toggles the MSI enable bit.
> This can mean only a few minutes of usable run time.  Instead, track
> used GSIs in a bitmap.
>
> Signed-off-by: Alex Williamson <alex.williamson@hp.com>
> ---

Acked.
Avi Kivity May 13, 2009, 9:47 a.m. UTC | #2
Alex Williamson wrote:
> We're currently using a counter to track the most recent GSI we've
> handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
> assignment with a driver that regularly toggles the MSI enable bit.
> This can mean only a few minutes of usable run time.  Instead, track
> used GSIs in a bitmap.
>
> Signed-off-by: Alex Williamson <alex.williamson@hp.com>
> ---
>
>  v2: Added mutex to protect gsi bitmap
>   

Why is the mutex needed?  We already have mutex protection in qemu.

How often does the driver enable/disable the MSI (and, do you now why)?  
If it's often enough it may justify kernel support.  (We'll need this 
patch in any case for kernels without this new support).
Alex Williamson May 13, 2009, 12:28 p.m. UTC | #3
On Wed, 2009-05-13 at 12:47 +0300, Avi Kivity wrote:
> Alex Williamson wrote:
> > We're currently using a counter to track the most recent GSI we've
> > handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
> > assignment with a driver that regularly toggles the MSI enable bit.
> > This can mean only a few minutes of usable run time.  Instead, track
> > used GSIs in a bitmap.
> >
> > Signed-off-by: Alex Williamson <alex.williamson@hp.com>
> > ---
> >
> >  v2: Added mutex to protect gsi bitmap
> >   
> 
> Why is the mutex needed?  We already have mutex protection in qemu.

If it's unneeded, I'll happily remove it.  I was assuming in a guest
with multiple devices these could come in parallel, but maybe the guest
is already serialized for config space accesses via cfc/cf8.

> How often does the driver enable/disable the MSI (and, do you now why)?  
> If it's often enough it may justify kernel support.  (We'll need this 
> patch in any case for kernels without this new support).

Seems like multiple times per second.  I don't know why.  Now I'm
starting to get curious why nobody else seems to be hitting this.  I'm
seeing it on an e1000e NIC and Qlogic fibre channel.  Is everyone else
using MSI-X or regular interrupts vs MSI?

Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity May 13, 2009, 12:35 p.m. UTC | #4
Alex Williamson wrote:
> On Wed, 2009-05-13 at 12:47 +0300, Avi Kivity wrote:
>   
>> Alex Williamson wrote:
>>     
>>> We're currently using a counter to track the most recent GSI we've
>>> handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
>>> assignment with a driver that regularly toggles the MSI enable bit.
>>> This can mean only a few minutes of usable run time.  Instead, track
>>> used GSIs in a bitmap.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@hp.com>
>>> ---
>>>
>>>  v2: Added mutex to protect gsi bitmap
>>>   
>>>       
>> Why is the mutex needed?  We already have mutex protection in qemu.
>>     
>
> If it's unneeded, I'll happily remove it.  I was assuming in a guest
> with multiple devices these could come in parallel, but maybe the guest
> is already serialized for config space accesses via cfc/cf8.
>   

The guest may or may not be serialized; we can't rely on that.  But qemu 
is, and we can.

>   
>> How often does the driver enable/disable the MSI (and, do you now why)?  
>> If it's often enough it may justify kernel support.  (We'll need this 
>> patch in any case for kernels without this new support).
>>     
>
> Seems like multiple times per second.  I don't know why.  Now I'm
> starting to get curious why nobody else seems to be hitting this.  I'm
> seeing it on an e1000e NIC and Qlogic fibre channel.  Is everyone else
> using MSI-X or regular interrupts vs MSI?
>   

When you say "multiple times", it is several, or a lot more?

Maybe it is NAPI?
Alex Williamson May 13, 2009, 12:55 p.m. UTC | #5
On Wed, 2009-05-13 at 15:35 +0300, Avi Kivity wrote:
> Alex Williamson wrote:
> > On Wed, 2009-05-13 at 12:47 +0300, Avi Kivity wrote:
> >   
> >> Alex Williamson wrote:
> >>     
> >>> We're currently using a counter to track the most recent GSI we've
> >>> handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
> >>> assignment with a driver that regularly toggles the MSI enable bit.
> >>> This can mean only a few minutes of usable run time.  Instead, track
> >>> used GSIs in a bitmap.
> >>>
> >>> Signed-off-by: Alex Williamson <alex.williamson@hp.com>
> >>> ---
> >>>
> >>>  v2: Added mutex to protect gsi bitmap
> >>>   
> >>>       
> >> Why is the mutex needed?  We already have mutex protection in qemu.
> >>     
> >
> > If it's unneeded, I'll happily remove it.  I was assuming in a guest
> > with multiple devices these could come in parallel, but maybe the guest
> > is already serialized for config space accesses via cfc/cf8.
> >   
> 
> The guest may or may not be serialized; we can't rely on that.  But qemu 
> is, and we can.

Ok, I'll drop the mutex here.
   
> >> How often does the driver enable/disable the MSI (and, do you now why)?  
> >> If it's often enough it may justify kernel support.  (We'll need this 
> >> patch in any case for kernels without this new support).
> >>     
> >
> > Seems like multiple times per second.  I don't know why.  Now I'm
> > starting to get curious why nobody else seems to be hitting this.  I'm
> > seeing it on an e1000e NIC and Qlogic fibre channel.  Is everyone else
> > using MSI-X or regular interrupts vs MSI?
> >   
> 
> When you say "multiple times", it is several, or a lot more?
> 
> Maybe it is NAPI?

The system would run out of the ~1000 available GSIs in a minute or two
with just an e1000e available to the guest.  So that's something on the
order of 10/s.  This also causes a printk in the host ever time the
interrupt in enabled, which can't help performance and gets pretty
annoying for syslog.  I was guessing some kind of interrupt mitigation,
such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
slower rate).

Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity May 13, 2009, 1 p.m. UTC | #6
Alex Williamson wrote:
>>
>> When you say "multiple times", it is several, or a lot more?
>>
>> Maybe it is NAPI?
>>     
>
> The system would run out of the ~1000 available GSIs in a minute or two
> with just an e1000e available to the guest.  So that's something on the
> order of 10/s.  This also causes a printk in the host ever time the
> interrupt in enabled, which can't help performance and gets pretty
> annoying for syslog.  I was guessing some kind of interrupt mitigation,
> such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
> slower rate).
>   

I see.  And what is the path by which it is disabled?  The mask bit in 
the MSI entry?
Alex Williamson May 13, 2009, 1:11 p.m. UTC | #7
On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote:
> Alex Williamson wrote:
> >>
> >> When you say "multiple times", it is several, or a lot more?
> >>
> >> Maybe it is NAPI?
> >>     
> >
> > The system would run out of the ~1000 available GSIs in a minute or two
> > with just an e1000e available to the guest.  So that's something on the
> > order of 10/s.  This also causes a printk in the host ever time the
> > interrupt in enabled, which can't help performance and gets pretty
> > annoying for syslog.  I was guessing some kind of interrupt mitigation,
> > such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
> > slower rate).
> >   
> 
> I see.  And what is the path by which it is disabled?  The mask bit in 
> the MSI entry?

Yes, I believe the only path is via a write to the MSI capability in the
PCI config space.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 13, 2009, 1:55 p.m. UTC | #8
On Wed, May 13, 2009 at 07:11:16AM -0600, Alex Williamson wrote:
> On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote:
> > Alex Williamson wrote:
> > >>
> > >> When you say "multiple times", it is several, or a lot more?
> > >>
> > >> Maybe it is NAPI?
> > >>     
> > >
> > > The system would run out of the ~1000 available GSIs in a minute or two
> > > with just an e1000e available to the guest.  So that's something on the
> > > order of 10/s.  This also causes a printk in the host ever time the
> > > interrupt in enabled, which can't help performance and gets pretty
> > > annoying for syslog.  I was guessing some kind of interrupt mitigation,
> > > such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
> > > slower rate).
> > >   
> > 
> > I see.  And what is the path by which it is disabled?  The mask bit in 
> > the MSI entry?
> 
> Yes, I believe the only path is via a write to the MSI capability in the
> PCI config space.
> 
> Alex

Very surprising: I haven't seen any driver disable MSI expect on device
destructor path. Is this a linux guest?
Alex Williamson May 13, 2009, 2:15 p.m. UTC | #9
On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote:
> On Wed, May 13, 2009 at 07:11:16AM -0600, Alex Williamson wrote:
> > On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote:
> > > Alex Williamson wrote:
> > > >>
> > > >> When you say "multiple times", it is several, or a lot more?
> > > >>
> > > >> Maybe it is NAPI?
> > > >>     
> > > >
> > > > The system would run out of the ~1000 available GSIs in a minute or two
> > > > with just an e1000e available to the guest.  So that's something on the
> > > > order of 10/s.  This also causes a printk in the host ever time the
> > > > interrupt in enabled, which can't help performance and gets pretty
> > > > annoying for syslog.  I was guessing some kind of interrupt mitigation,
> > > > such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
> > > > slower rate).
> > > >   
> > > 
> > > I see.  And what is the path by which it is disabled?  The mask bit in 
> > > the MSI entry?
> > 
> > Yes, I believe the only path is via a write to the MSI capability in the
> > PCI config space.
> > 
> > Alex
> 
> Very surprising: I haven't seen any driver disable MSI expect on device
> destructor path. Is this a linux guest?

Yes, Debian 2.6.26 kernel.  I'll check it it behaves the same on newer
upstream kernels and try to figure out why it's doing it.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 13, 2009, 2:30 p.m. UTC | #10
On Wed, May 13, 2009 at 08:15:29AM -0600, Alex Williamson wrote:
> On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote:
> > On Wed, May 13, 2009 at 07:11:16AM -0600, Alex Williamson wrote:
> > > On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote:
> > > > Alex Williamson wrote:
> > > > >>
> > > > >> When you say "multiple times", it is several, or a lot more?
> > > > >>
> > > > >> Maybe it is NAPI?
> > > > >>     
> > > > >
> > > > > The system would run out of the ~1000 available GSIs in a minute or two
> > > > > with just an e1000e available to the guest.  So that's something on the
> > > > > order of 10/s.  This also causes a printk in the host ever time the
> > > > > interrupt in enabled, which can't help performance and gets pretty
> > > > > annoying for syslog.  I was guessing some kind of interrupt mitigation,
> > > > > such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
> > > > > slower rate).
> > > > >   
> > > > 
> > > > I see.  And what is the path by which it is disabled?  The mask bit in 
> > > > the MSI entry?
> > > 
> > > Yes, I believe the only path is via a write to the MSI capability in the
> > > PCI config space.
> > > 
> > > Alex
> > 
> > Very surprising: I haven't seen any driver disable MSI expect on device
> > destructor path. Is this a linux guest?
> 
> Yes, Debian 2.6.26 kernel.  I'll check it it behaves the same on newer
> upstream kernels and try to figure out why it's doing it.
> 
> Alex

Maybe power management powers down the card? Why would that be?
Or maybe it detects some error in emulation and resets the card?
Michael S. Tsirkin May 13, 2009, 2:32 p.m. UTC | #11
On Tue, May 12, 2009 at 10:41:29PM -0600, Alex Williamson wrote:
> +	gsi_count = kvm_get_gsi_count(kvm);
> +	/* Round up so we can search ints using ffs */
> +	gsi_bytes = ((gsi_count + 31) / 32) * 4;
> +	kvm->used_gsi_bitmap = malloc(gsi_bytes);

What happens on error in kvm_get_gsi_count?
gsi_count will be negative ..
Alex Williamson May 13, 2009, 2:33 p.m. UTC | #12
On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote:
> On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote:
> > On Wed, May 13, 2009 at 07:11:16AM -0600, Alex Williamson wrote:
> > > On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote:
> > > > Alex Williamson wrote:
> > > > >>
> > > > >> When you say "multiple times", it is several, or a lot more?
> > > > >>
> > > > >> Maybe it is NAPI?
> > > > >>     
> > > > >
> > > > > The system would run out of the ~1000 available GSIs in a minute or two
> > > > > with just an e1000e available to the guest.  So that's something on the
> > > > > order of 10/s.  This also causes a printk in the host ever time the
> > > > > interrupt in enabled, which can't help performance and gets pretty
> > > > > annoying for syslog.  I was guessing some kind of interrupt mitigation,
> > > > > such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
> > > > > slower rate).
> > > > >   
> > > > 
> > > > I see.  And what is the path by which it is disabled?  The mask bit in 
> > > > the MSI entry?
> > > 
> > > Yes, I believe the only path is via a write to the MSI capability in the
> > > PCI config space.
> > > 
> > > Alex
> > 
> > Very surprising: I haven't seen any driver disable MSI expect on device
> > destructor path. Is this a linux guest?
> 
> Yes, Debian 2.6.26 kernel.  I'll check it it behaves the same on newer
> upstream kernels and try to figure out why it's doing it.

Updating the guest to 2.6.29 seems to fix the interrupt toggling.  So
it's either something in older kernels or something debian introduced,
but that seems unlikely.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson May 13, 2009, 11:07 p.m. UTC | #13
On Wed, 2009-05-13 at 08:33 -0600, Alex Williamson wrote:
> On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote:
> > On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote:
> > > Very surprising: I haven't seen any driver disable MSI expect on device
> > > destructor path. Is this a linux guest?
> > 
> > Yes, Debian 2.6.26 kernel.  I'll check it it behaves the same on newer
> > upstream kernels and try to figure out why it's doing it.
> 
> Updating the guest to 2.6.29 seems to fix the interrupt toggling.  So
> it's either something in older kernels or something debian introduced,
> but that seems unlikely.

For the curious, this was fixed prior to 2.6.27-rc1 by this:

commit ce6fce4295ba727b36fdc73040e444bd1aae64cd
Author: Matthew Wilcox
Date:   Fri Jul 25 15:42:58 2008 -0600

    PCI MSI: Don't disable MSIs if the mask bit isn't supported
    
    David Vrabel has a device which generates an interrupt storm on the INTx
    pin if we disable MSI interrupts altogether.  Masking interrupts is only
    a performance optimisation, so we can ignore the request to mask the
    interrupt.

It looks like without the maskbit attribute on MSI, the default way to
mask an MSI interrupt was to toggle the MSI enable bit.  This was
introduced in 58e0543e8f355b32f0778a18858b255adb7402ae, so it's lifespan
was probably 2.6.21 - 2.6.26.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity May 17, 2009, 8:47 p.m. UTC | #14
Alex Williamson wrote:
> On Wed, 2009-05-13 at 08:33 -0600, Alex Williamson wrote:
>   
>> On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote:
>>     
>>> On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote:
>>>       
>>>> Very surprising: I haven't seen any driver disable MSI expect on device
>>>> destructor path. Is this a linux guest?
>>>>         
>>> Yes, Debian 2.6.26 kernel.  I'll check it it behaves the same on newer
>>> upstream kernels and try to figure out why it's doing it.
>>>       
>> Updating the guest to 2.6.29 seems to fix the interrupt toggling.  So
>> it's either something in older kernels or something debian introduced,
>> but that seems unlikely.
>>     
>
> For the curious, this was fixed prior to 2.6.27-rc1 by this:
>
> commit ce6fce4295ba727b36fdc73040e444bd1aae64cd
> Author: Matthew Wilcox
> Date:   Fri Jul 25 15:42:58 2008 -0600
>
>     PCI MSI: Don't disable MSIs if the mask bit isn't supported
>     
>     David Vrabel has a device which generates an interrupt storm on the INTx
>     pin if we disable MSI interrupts altogether.  Masking interrupts is only
>     a performance optimisation, so we can ignore the request to mask the
>     interrupt.
>
> It looks like without the maskbit attribute on MSI, the default way to
> mask an MSI interrupt was to toggle the MSI enable bit.  This was
> introduced in 58e0543e8f355b32f0778a18858b255adb7402ae, so it's lifespan
> was probably 2.6.21 - 2.6.26.
>
>   

On the other hand, if the host device supports this maskbit attribute, 
we might want to support it.  I'm not sure exactly how though.

If we trap msi entry writes, we're inviting the guest to exit every time 
it wants to disable interrupts.  If we don't, we're inviting spurious 
interrupts, which will cause unwanted exits and injections.

Maybe we ought to let the guest write to the msi tables without 
trapping, and in the injection logic do something like

    msi_entry = *msi_entry_ptr;
    mb();
    if (msi_entry != msi->last_msi_entry)
         msi_reconfigure();
    if (msi_enabled(msi_entry))
         insert_the_needle();
Michael S. Tsirkin May 18, 2009, 11:12 a.m. UTC | #15
On Sun, May 17, 2009 at 11:47:53PM +0300, Avi Kivity wrote:
> Alex Williamson wrote:
>> On Wed, 2009-05-13 at 08:33 -0600, Alex Williamson wrote:
>>   
>>> On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote:
>>>     
>>>> On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote:
>>>>       
>>>>> Very surprising: I haven't seen any driver disable MSI expect on device
>>>>> destructor path. Is this a linux guest?
>>>>>         
>>>> Yes, Debian 2.6.26 kernel.  I'll check it it behaves the same on newer
>>>> upstream kernels and try to figure out why it's doing it.
>>>>       
>>> Updating the guest to 2.6.29 seems to fix the interrupt toggling.  So
>>> it's either something in older kernels or something debian introduced,
>>> but that seems unlikely.
>>>     
>>
>> For the curious, this was fixed prior to 2.6.27-rc1 by this:
>>
>> commit ce6fce4295ba727b36fdc73040e444bd1aae64cd
>> Author: Matthew Wilcox
>> Date:   Fri Jul 25 15:42:58 2008 -0600
>>
>>     PCI MSI: Don't disable MSIs if the mask bit isn't supported
>>         David Vrabel has a device which generates an interrupt storm on 
>> the INTx
>>     pin if we disable MSI interrupts altogether.  Masking interrupts is only
>>     a performance optimisation, so we can ignore the request to mask the
>>     interrupt.
>>
>> It looks like without the maskbit attribute on MSI, the default way to
>> mask an MSI interrupt was to toggle the MSI enable bit.  This was
>> introduced in 58e0543e8f355b32f0778a18858b255adb7402ae, so it's lifespan
>> was probably 2.6.21 - 2.6.26.
>>
>>   
>
> On the other hand, if the host device supports this maskbit attribute,  
> we might want to support it.  I'm not sure exactly how though.
>
> If we trap msi entry writes, we're inviting the guest to exit every time  
> it wants to disable interrupts.  If we don't, we're inviting spurious  
> interrupts, which will cause unwanted exits and injections.

Avi, I think you are mixing up MSI and MSI-X. MSI does not have any tables:
all changes go through configuration writes, which AFAIK we always trap.
Isn't that right?

On the other hand, in MSI-X mask bit is mandatory, not optional
so we'll have to support it for assigned devices at some point.

If we are worried about speed of masking/unmasking MSI-X interrupts for
assigned devices (older kernels used to mask them, recent kernels leave
this to drivers) we will probably need to have MSI-X support in the
kernel, and have kernel examine the mask bit before injecting the
interrupt, just like real devices do.

> Maybe we ought to let the guest write to the msi tables without  
> trapping, and in the injection logic do something like
>
>    msi_entry = *msi_entry_ptr;
>    mb();
>    if (msi_entry != msi->last_msi_entry)
>         msi_reconfigure();
>    if (msi_enabled(msi_entry))
>         insert_the_needle();

I don't really understand why do you need the reconfigure
and tracking last msi entry here.
Avi Kivity May 18, 2009, 11:36 a.m. UTC | #16
Michael S. Tsirkin wrote:
> On Sun, May 17, 2009 at 11:47:53PM +0300, Avi Kivity wrote:
>   
>> Alex Williamson wrote:
>>     
>>> On Wed, 2009-05-13 at 08:33 -0600, Alex Williamson wrote:
>>>   
>>>       
>>>> On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote:
>>>>     
>>>>         
>>>>> On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote:
>>>>>       
>>>>>           
>>>>>> Very surprising: I haven't seen any driver disable MSI expect on device
>>>>>> destructor path. Is this a linux guest?
>>>>>>         
>>>>>>             
>>>>> Yes, Debian 2.6.26 kernel.  I'll check it it behaves the same on newer
>>>>> upstream kernels and try to figure out why it's doing it.
>>>>>       
>>>>>           
>>>> Updating the guest to 2.6.29 seems to fix the interrupt toggling.  So
>>>> it's either something in older kernels or something debian introduced,
>>>> but that seems unlikely.
>>>>     
>>>>         
>>> For the curious, this was fixed prior to 2.6.27-rc1 by this:
>>>
>>> commit ce6fce4295ba727b36fdc73040e444bd1aae64cd
>>> Author: Matthew Wilcox
>>> Date:   Fri Jul 25 15:42:58 2008 -0600
>>>
>>>     PCI MSI: Don't disable MSIs if the mask bit isn't supported
>>>         David Vrabel has a device which generates an interrupt storm on 
>>> the INTx
>>>     pin if we disable MSI interrupts altogether.  Masking interrupts is only
>>>     a performance optimisation, so we can ignore the request to mask the
>>>     interrupt.
>>>
>>> It looks like without the maskbit attribute on MSI, the default way to
>>> mask an MSI interrupt was to toggle the MSI enable bit.  This was
>>> introduced in 58e0543e8f355b32f0778a18858b255adb7402ae, so it's lifespan
>>> was probably 2.6.21 - 2.6.26.
>>>
>>>   
>>>       
>> On the other hand, if the host device supports this maskbit attribute,  
>> we might want to support it.  I'm not sure exactly how though.
>>
>> If we trap msi entry writes, we're inviting the guest to exit every time  
>> it wants to disable interrupts.  If we don't, we're inviting spurious  
>> interrupts, which will cause unwanted exits and injections.
>>     
>
> Avi, I think you are mixing up MSI and MSI-X. MSI does not have any tables:
> all changes go through configuration writes, which AFAIK we always trap.
> Isn't that right?
>   

Right.

> On the other hand, in MSI-X mask bit is mandatory, not optional
> so we'll have to support it for assigned devices at some point.
>
> If we are worried about speed of masking/unmasking MSI-X interrupts for
> assigned devices (older kernels used to mask them, recent kernels leave
> this to drivers) we will probably need to have MSI-X support in the
> kernel, and have kernel examine the mask bit before injecting the
> interrupt, just like real devices do.
>   

Yes.  Let's actually quantify this though.  Several times per second 
doesn't qualify.

>> Maybe we ought to let the guest write to the msi tables without  
>> trapping, and in the injection logic do something like
>>
>>    msi_entry = *msi_entry_ptr;
>>    mb();
>>    if (msi_entry != msi->last_msi_entry)
>>         msi_reconfigure();
>>    if (msi_enabled(msi_entry))
>>         insert_the_needle();
>>     
>
> I don't really understand why do you need the reconfigure
> and tracking last msi entry here.
>   

The msi entry can change the vector and delivery mode, no?  This way we 
can cache some stuff instead of decoding it each time.  For example, if 
the entry points at a specific vcpu, we can cache the vcpu pointer, 
instead of searching for the apic ID.
Michael S. Tsirkin May 18, 2009, 12:19 p.m. UTC | #17
On Mon, May 18, 2009 at 02:36:41PM +0300, Avi Kivity wrote:
> Michael S. Tsirkin wrote:
>> On Sun, May 17, 2009 at 11:47:53PM +0300, Avi Kivity wrote:
>>   
>>> Alex Williamson wrote:
>>>     
>>>> On Wed, 2009-05-13 at 08:33 -0600, Alex Williamson wrote:
>>>>         
>>>>> On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote:
>>>>>             
>>>>>> On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote:
>>>>>>                 
>>>>>>> Very surprising: I haven't seen any driver disable MSI expect on device
>>>>>>> destructor path. Is this a linux guest?
>>>>>>>                     
>>>>>> Yes, Debian 2.6.26 kernel.  I'll check it it behaves the same on newer
>>>>>> upstream kernels and try to figure out why it's doing it.
>>>>>>                 
>>>>> Updating the guest to 2.6.29 seems to fix the interrupt toggling.  So
>>>>> it's either something in older kernels or something debian introduced,
>>>>> but that seems unlikely.
>>>>>             
>>>> For the curious, this was fixed prior to 2.6.27-rc1 by this:
>>>>
>>>> commit ce6fce4295ba727b36fdc73040e444bd1aae64cd
>>>> Author: Matthew Wilcox
>>>> Date:   Fri Jul 25 15:42:58 2008 -0600
>>>>
>>>>     PCI MSI: Don't disable MSIs if the mask bit isn't supported
>>>>         David Vrabel has a device which generates an interrupt 
>>>> storm on the INTx
>>>>     pin if we disable MSI interrupts altogether.  Masking interrupts is only
>>>>     a performance optimisation, so we can ignore the request to mask the
>>>>     interrupt.
>>>>
>>>> It looks like without the maskbit attribute on MSI, the default way to
>>>> mask an MSI interrupt was to toggle the MSI enable bit.  This was
>>>> introduced in 58e0543e8f355b32f0778a18858b255adb7402ae, so it's lifespan
>>>> was probably 2.6.21 - 2.6.26.
>>>>
>>>>         
>>> On the other hand, if the host device supports this maskbit 
>>> attribute,  we might want to support it.  I'm not sure exactly how 
>>> though.
>>>
>>> If we trap msi entry writes, we're inviting the guest to exit every 
>>> time  it wants to disable interrupts.  If we don't, we're inviting 
>>> spurious  interrupts, which will cause unwanted exits and injections.
>>>     
>>
>> Avi, I think you are mixing up MSI and MSI-X. MSI does not have any tables:
>> all changes go through configuration writes, which AFAIK we always trap.
>> Isn't that right?
>>   
>
> Right.
>
>> On the other hand, in MSI-X mask bit is mandatory, not optional
>> so we'll have to support it for assigned devices at some point.
>>
>> If we are worried about speed of masking/unmasking MSI-X interrupts for
>> assigned devices (older kernels used to mask them, recent kernels leave
>> this to drivers) we will probably need to have MSI-X support in the
>> kernel, and have kernel examine the mask bit before injecting the
>> interrupt, just like real devices do.
>>   
>
> Yes.

Actually, if we do that, we'll need to address a race where a driver
has updated the mask bit in the window after we tested it
and before we inject the interrupt. Not sure how to do this.

> Let's actually quantify this though.  Several times per second  
> doesn't qualify.

Oh, I didn't actually imply that we need to optimize this path.
You seemed worried about the speed of masking the interrupt,
and I commented that to optimize it we'll have to move it
to kernel.

>>> Maybe we ought to let the guest write to the msi tables without   
>>> trapping, and in the injection logic do something like
>>>
>>>    msi_entry = *msi_entry_ptr;
>>>    mb();
>>>    if (msi_entry != msi->last_msi_entry)
>>>         msi_reconfigure();
>>>    if (msi_enabled(msi_entry))
>>>         insert_the_needle();
>>>     
>>
>> I don't really understand why do you need the reconfigure
>> and tracking last msi entry here.
>>   
>
> The msi entry can change the vector and delivery mode, no?  This way we  
> can cache some stuff instead of decoding it each time.  For example, if  
> the entry points at a specific vcpu, we can cache the vcpu pointer,  
> instead of searching for the apic ID.
Avi Kivity May 18, 2009, 12:33 p.m. UTC | #18
Michael S. Tsirkin wrote:
>>
>>> On the other hand, in MSI-X mask bit is mandatory, not optional
>>> so we'll have to support it for assigned devices at some point.
>>>
>>> If we are worried about speed of masking/unmasking MSI-X interrupts for
>>> assigned devices (older kernels used to mask them, recent kernels leave
>>> this to drivers) we will probably need to have MSI-X support in the
>>> kernel, and have kernel examine the mask bit before injecting the
>>> interrupt, just like real devices do.
>>>   
>>>       
>> Yes.
>>     
>
> Actually, if we do that, we'll need to address a race where a driver
> has updated the mask bit in the window after we tested it
> and before we inject the interrupt. Not sure how to do this.
>   

The driver can't tell if the interrupt came first, so it's a valid race 
(real hardware has the same race).
Michael S. Tsirkin May 18, 2009, 1:45 p.m. UTC | #19
On Mon, May 18, 2009 at 03:33:20PM +0300, Avi Kivity wrote:
> Michael S. Tsirkin wrote:
>>>
>>>> On the other hand, in MSI-X mask bit is mandatory, not optional
>>>> so we'll have to support it for assigned devices at some point.
>>>>
>>>> If we are worried about speed of masking/unmasking MSI-X interrupts for
>>>> assigned devices (older kernels used to mask them, recent kernels leave
>>>> this to drivers) we will probably need to have MSI-X support in the
>>>> kernel, and have kernel examine the mask bit before injecting the
>>>> interrupt, just like real devices do.
>>>>         
>>> Yes.
>>>     
>>
>> Actually, if we do that, we'll need to address a race where a driver
>> has updated the mask bit in the window after we tested it
>> and before we inject the interrupt. Not sure how to do this.
>>   
>
> The driver can't tell if the interrupt came first, so it's a valid race  
> (real hardware has the same race).

The driver for real device can do a read followed by sync_irq to flush
out interrupts.
Avi Kivity May 18, 2009, 1:55 p.m. UTC | #20
Michael S. Tsirkin wrote:
> On Mon, May 18, 2009 at 03:33:20PM +0300, Avi Kivity wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>>>> On the other hand, in MSI-X mask bit is mandatory, not optional
>>>>> so we'll have to support it for assigned devices at some point.
>>>>>
>>>>> If we are worried about speed of masking/unmasking MSI-X interrupts for
>>>>> assigned devices (older kernels used to mask them, recent kernels leave
>>>>> this to drivers) we will probably need to have MSI-X support in the
>>>>> kernel, and have kernel examine the mask bit before injecting the
>>>>> interrupt, just like real devices do.
>>>>>         
>>>>>           
>>>> Yes.
>>>>     
>>>>         
>>> Actually, if we do that, we'll need to address a race where a driver
>>> has updated the mask bit in the window after we tested it
>>> and before we inject the interrupt. Not sure how to do this.
>>>   
>>>       
>> The driver can't tell if the interrupt came first, so it's a valid race  
>> (real hardware has the same race).
>>     
>
> The driver for real device can do a read followed by sync_irq to flush
> out interrupts.
>   

If it generates the interrupt after masking it in the msi-x entry, we'll 
see it.  If it generates the interrupt before masking it, it may or may 
not receive the interrupt, even on real hardware.
Michael S. Tsirkin May 18, 2009, 2:40 p.m. UTC | #21
On Mon, May 18, 2009 at 04:55:30PM +0300, Avi Kivity wrote:
> Michael S. Tsirkin wrote:
>> On Mon, May 18, 2009 at 03:33:20PM +0300, Avi Kivity wrote:
>>   
>>> Michael S. Tsirkin wrote:
>>>     
>>>>>> On the other hand, in MSI-X mask bit is mandatory, not optional
>>>>>> so we'll have to support it for assigned devices at some point.
>>>>>>
>>>>>> If we are worried about speed of masking/unmasking MSI-X interrupts for
>>>>>> assigned devices (older kernels used to mask them, recent kernels leave
>>>>>> this to drivers) we will probably need to have MSI-X support in the
>>>>>> kernel, and have kernel examine the mask bit before injecting the
>>>>>> interrupt, just like real devices do.
>>>>>>                   
>>>>> Yes.
>>>>>             
>>>> Actually, if we do that, we'll need to address a race where a driver
>>>> has updated the mask bit in the window after we tested it
>>>> and before we inject the interrupt. Not sure how to do this.
>>>>         
>>> The driver can't tell if the interrupt came first, so it's a valid 
>>> race  (real hardware has the same race).
>>>     
>>
>> The driver for real device can do a read followed by sync_irq to flush
>> out interrupts.
>>   
>
> If it generates the interrupt after masking it in the msi-x entry, we'll  
> see it.  If it generates the interrupt before masking it, it may or may  
> not receive the interrupt, even on real hardware.

Yes but in the later case, real hardware must re-send the pending
interrupt after it is unmasked (that's the spec). We would just lose it.
Avi Kivity May 18, 2009, 2:46 p.m. UTC | #22
Michael S. Tsirkin wrote:
>> If it generates the interrupt after masking it in the msi-x entry, we'll  
>> see it.  If it generates the interrupt before masking it, it may or may  
>> not receive the interrupt, even on real hardware.
>>     
>
> Yes but in the later case, real hardware must re-send the pending
> interrupt after it is unmasked (that's the spec). We would just lose it.
>   

That's a different matter.  We need to buffer the interrupt pending bit, 
and a way for userspace to either query that buffer or have a 
conditional injection (inject_if_pending).
Michael S. Tsirkin May 18, 2009, 3:01 p.m. UTC | #23
On Mon, May 18, 2009 at 05:46:09PM +0300, Avi Kivity wrote:
> Michael S. Tsirkin wrote:
>>> If it generates the interrupt after masking it in the msi-x entry, 
>>> we'll  see it.  If it generates the interrupt before masking it, it 
>>> may or may  not receive the interrupt, even on real hardware.
>>>     
>>
>> Yes but in the later case, real hardware must re-send the pending
>> interrupt after it is unmasked (that's the spec). We would just lose it.
>>   
>
> That's a different matter.  We need to buffer the interrupt pending bit,  
> and a way for userspace to either query that buffer or have a  
> conditional injection (inject_if_pending).

Here's the race as I see it: we discussed the possibility
of making kernel and user share and actual memory page,
and using that for MSI-X tables.

	host kernel want to send msi x message
	host kernel test mask bit: unmasked
	guest sets mask bit
	guest does read to flash msi writes
	guest does sync irq and makes sure there are no
                           outstanging interrupts
	---> at this stage guest expects not to get interrupts
	guest starts editing msix entry

	host kernel never saw mask so it sends message to the old address
	       or even a corrupted address which the guest is in
               the middle of editing
	bad things happen

This race is not easy to solve, except by catching writes to msix table,
and syncronising them with interrupt delivery.
Avi Kivity May 18, 2009, 3:08 p.m. UTC | #24
Michael S. Tsirkin wrote:
> Here's the race as I see it: we discussed the possibility
> of making kernel and user share and actual memory page,
> and using that for MSI-X tables.
>
> 	host kernel want to send msi x message
> 	host kernel test mask bit: unmasked
> 	guest sets mask bit
> 	guest does read to flash msi writes
> 	guest does sync irq and makes sure there are no
>                            outstanging interrupts
> 	---> at this stage guest expects not to get interrupts
> 	guest starts editing msix entry
>
> 	host kernel never saw mask so it sends message to the old address
> 	       or even a corrupted address which the guest is in
>                the middle of editing
> 	bad things happen
>
> This race is not easy to solve, except by catching writes to msix table,
> and syncronising them with interrupt delivery.
>   

You're right of course.  In any case this is premature, we'll have to 
see if this happens with any frequency.
diff mbox

Patch

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index a7365c8..a6cc9b9 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -561,8 +561,10 @@  static void free_dev_irq_entries(AssignedDevice *dev)
 {
     int i;
 
-    for (i = 0; i < dev->irq_entries_nr; i++)
+    for (i = 0; i < dev->irq_entries_nr; i++) {
         kvm_del_routing_entry(kvm_context, &dev->entry[i]);
+        kvm_free_irq_route_gsi(kvm_context, dev->entry[i].gsi);
+    }
     free(dev->entry);
     dev->entry = NULL;
     dev->irq_entries_nr = 0;
diff --git a/kvm/libkvm/kvm-common.h b/kvm/libkvm/kvm-common.h
index 591fb53..4b3cb51 100644
--- a/kvm/libkvm/kvm-common.h
+++ b/kvm/libkvm/kvm-common.h
@@ -66,8 +66,10 @@  struct kvm_context {
 #ifdef KVM_CAP_IRQ_ROUTING
 	struct kvm_irq_routing *irq_routes;
 	int nr_allocated_irq_routes;
+	void *used_gsi_bitmap;
+	int max_gsi;
+	pthread_mutex_t gsi_mutex;
 #endif
-	int max_used_gsi;
 };
 
 int kvm_alloc_kernel_memory(kvm_context_t kvm, unsigned long memory,
diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c
index ba0a5d1..3eaa120 100644
--- a/kvm/libkvm/libkvm.c
+++ b/kvm/libkvm/libkvm.c
@@ -35,6 +35,7 @@ 
 #include <errno.h>
 #include <sys/ioctl.h>
 #include <inttypes.h>
+#include <pthread.h>
 #include "libkvm.h"
 
 #if defined(__x86_64__) || defined(__i386__)
@@ -65,6 +66,8 @@ 
 int kvm_abi = EXPECTED_KVM_API_VERSION;
 int kvm_page_size;
 
+static inline void set_bit(uint32_t *buf, unsigned int bit);
+
 struct slot_info {
 	unsigned long phys_addr;
 	unsigned long len;
@@ -286,6 +289,9 @@  kvm_context_t kvm_init(struct kvm_callbacks *callbacks,
 	int fd;
 	kvm_context_t kvm;
 	int r;
+#ifdef KVM_CAP_IRQ_ROUTING
+	int gsi_count, gsi_bytes, i;
+#endif
 
 	fd = open("/dev/kvm", O_RDWR);
 	if (fd == -1) {
@@ -322,6 +328,25 @@  kvm_context_t kvm_init(struct kvm_callbacks *callbacks,
 	kvm->dirty_pages_log_all = 0;
 	kvm->no_irqchip_creation = 0;
 	kvm->no_pit_creation = 0;
+#ifdef KVM_CAP_IRQ_ROUTING
+	pthread_mutex_init(&kvm->gsi_mutex, NULL);
+
+	gsi_count = kvm_get_gsi_count(kvm);
+	/* Round up so we can search ints using ffs */
+	gsi_bytes = ((gsi_count + 31) / 32) * 4;
+	kvm->used_gsi_bitmap = malloc(gsi_bytes);
+	if (!kvm->used_gsi_bitmap)
+		goto out_close;
+	memset(kvm->used_gsi_bitmap, 0, gsi_bytes);
+	kvm->max_gsi = gsi_bytes * 8;
+
+	/* Mark all the IOAPIC pin GSIs and any over-allocated
+	 * GSIs as already in use. */
+	for (i = 0; i < KVM_IOAPIC_NUM_PINS; i++)
+		set_bit(kvm->used_gsi_bitmap, i);
+	for (i = gsi_count; i < kvm->max_gsi; i++)
+		set_bit(kvm->used_gsi_bitmap, i);
+#endif
 
 	return kvm;
  out_close:
@@ -1298,8 +1323,6 @@  int kvm_add_routing_entry(kvm_context_t kvm,
 	new->flags = entry->flags;
 	new->u = entry->u;
 
-	if (entry->gsi > kvm->max_used_gsi)
-		kvm->max_used_gsi = entry->gsi;
 	return 0;
 #else
 	return -ENOSYS;
@@ -1404,18 +1427,54 @@  int kvm_commit_irq_routes(kvm_context_t kvm)
 #endif
 }
 
+#ifdef KVM_CAP_IRQ_ROUTING
+static inline void set_bit(uint32_t *buf, unsigned int bit)
+{
+	buf[bit / 32] |= 1U << (bit % 32);
+}
+
+static inline void clear_bit(uint32_t *buf, unsigned int bit)
+{
+	buf[bit / 32] &= ~(1U << (bit % 32));
+}
+
+static int kvm_find_free_gsi(kvm_context_t kvm)
+{
+	int i, bit, gsi;
+	uint32_t *buf = kvm->used_gsi_bitmap;
+
+	for (i = 0; i < kvm->max_gsi / 32; i++) {
+		bit = ffs(~buf[i]);
+		if (!bit)
+			continue;
+
+		gsi = bit - 1 + i * 32;
+		set_bit(buf, gsi);
+		return gsi;
+	}
+
+	return -ENOSPC;
+}
+#endif
+
 int kvm_get_irq_route_gsi(kvm_context_t kvm)
 {
+	int gsi = -ENOSYS;
+
 #ifdef KVM_CAP_IRQ_ROUTING
-	if (kvm->max_used_gsi >= KVM_IOAPIC_NUM_PINS)  {
-	    if (kvm->max_used_gsi <= kvm_get_gsi_count(kvm))
-                return kvm->max_used_gsi + 1;
-            else
-                return -ENOSPC;
-        } else
-            return KVM_IOAPIC_NUM_PINS;
-#else
-	return -ENOSYS;
+	pthread_mutex_lock(&kvm->gsi_mutex);
+	gsi = kvm_find_free_gsi(kvm);
+	pthread_mutex_unlock(&kvm->gsi_mutex);
+#endif
+	return gsi;
+}
+ 
+void kvm_free_irq_route_gsi(kvm_context_t kvm, int gsi)
+{
+#ifdef KVM_CAP_IRQ_ROUTING
+	pthread_mutex_lock(&kvm->gsi_mutex);
+	clear_bit(kvm->used_gsi_bitmap, gsi);
+	pthread_mutex_unlock(&kvm->gsi_mutex);
 #endif
 }
 
diff --git a/kvm/libkvm/libkvm.h b/kvm/libkvm/libkvm.h
index 4821a1e..845bb8a 100644
--- a/kvm/libkvm/libkvm.h
+++ b/kvm/libkvm/libkvm.h
@@ -856,6 +856,16 @@  int kvm_commit_irq_routes(kvm_context_t kvm);
  */
 int kvm_get_irq_route_gsi(kvm_context_t kvm);
 
+/*!
+ * \brief Free used GSI number
+ *
+ * Free used GSI number acquired from kvm_get_irq_route_gsi()
+ *
+ * \param kvm Pointer to the current kvm_context
+ * \param gsi GSI number to free
+ */
+void kvm_free_irq_route_gsi(kvm_context_t kvm, int gsi);
+
 #ifdef KVM_CAP_DEVICE_MSIX
 int kvm_assign_set_msix_nr(kvm_context_t kvm,
 			   struct kvm_assigned_msix_nr *msix_nr);