Message ID | 20090513043835.6696.27384.stgit@dl380g6-3.ned.telco.ned.telco (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wednesday 13 May 2009 12:41:29 Alex Williamson wrote: > We're currently using a counter to track the most recent GSI we've > handed out. This quickly hits KVM_MAX_IRQ_ROUTES when using device > assignment with a driver that regularly toggles the MSI enable bit. > This can mean only a few minutes of usable run time. Instead, track > used GSIs in a bitmap. > > Signed-off-by: Alex Williamson <alex.williamson@hp.com> > --- Acked.
Alex Williamson wrote: > We're currently using a counter to track the most recent GSI we've > handed out. This quickly hits KVM_MAX_IRQ_ROUTES when using device > assignment with a driver that regularly toggles the MSI enable bit. > This can mean only a few minutes of usable run time. Instead, track > used GSIs in a bitmap. > > Signed-off-by: Alex Williamson <alex.williamson@hp.com> > --- > > v2: Added mutex to protect gsi bitmap > Why is the mutex needed? We already have mutex protection in qemu. How often does the driver enable/disable the MSI (and, do you now why)? If it's often enough it may justify kernel support. (We'll need this patch in any case for kernels without this new support).
On Wed, 2009-05-13 at 12:47 +0300, Avi Kivity wrote: > Alex Williamson wrote: > > We're currently using a counter to track the most recent GSI we've > > handed out. This quickly hits KVM_MAX_IRQ_ROUTES when using device > > assignment with a driver that regularly toggles the MSI enable bit. > > This can mean only a few minutes of usable run time. Instead, track > > used GSIs in a bitmap. > > > > Signed-off-by: Alex Williamson <alex.williamson@hp.com> > > --- > > > > v2: Added mutex to protect gsi bitmap > > > > Why is the mutex needed? We already have mutex protection in qemu. If it's unneeded, I'll happily remove it. I was assuming in a guest with multiple devices these could come in parallel, but maybe the guest is already serialized for config space accesses via cfc/cf8. > How often does the driver enable/disable the MSI (and, do you now why)? > If it's often enough it may justify kernel support. (We'll need this > patch in any case for kernels without this new support). Seems like multiple times per second. I don't know why. Now I'm starting to get curious why nobody else seems to be hitting this. I'm seeing it on an e1000e NIC and Qlogic fibre channel. Is everyone else using MSI-X or regular interrupts vs MSI? Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Williamson wrote: > On Wed, 2009-05-13 at 12:47 +0300, Avi Kivity wrote: > >> Alex Williamson wrote: >> >>> We're currently using a counter to track the most recent GSI we've >>> handed out. This quickly hits KVM_MAX_IRQ_ROUTES when using device >>> assignment with a driver that regularly toggles the MSI enable bit. >>> This can mean only a few minutes of usable run time. Instead, track >>> used GSIs in a bitmap. >>> >>> Signed-off-by: Alex Williamson <alex.williamson@hp.com> >>> --- >>> >>> v2: Added mutex to protect gsi bitmap >>> >>> >> Why is the mutex needed? We already have mutex protection in qemu. >> > > If it's unneeded, I'll happily remove it. I was assuming in a guest > with multiple devices these could come in parallel, but maybe the guest > is already serialized for config space accesses via cfc/cf8. > The guest may or may not be serialized; we can't rely on that. But qemu is, and we can. > >> How often does the driver enable/disable the MSI (and, do you now why)? >> If it's often enough it may justify kernel support. (We'll need this >> patch in any case for kernels without this new support). >> > > Seems like multiple times per second. I don't know why. Now I'm > starting to get curious why nobody else seems to be hitting this. I'm > seeing it on an e1000e NIC and Qlogic fibre channel. Is everyone else > using MSI-X or regular interrupts vs MSI? > When you say "multiple times", it is several, or a lot more? Maybe it is NAPI?
On Wed, 2009-05-13 at 15:35 +0300, Avi Kivity wrote: > Alex Williamson wrote: > > On Wed, 2009-05-13 at 12:47 +0300, Avi Kivity wrote: > > > >> Alex Williamson wrote: > >> > >>> We're currently using a counter to track the most recent GSI we've > >>> handed out. This quickly hits KVM_MAX_IRQ_ROUTES when using device > >>> assignment with a driver that regularly toggles the MSI enable bit. > >>> This can mean only a few minutes of usable run time. Instead, track > >>> used GSIs in a bitmap. > >>> > >>> Signed-off-by: Alex Williamson <alex.williamson@hp.com> > >>> --- > >>> > >>> v2: Added mutex to protect gsi bitmap > >>> > >>> > >> Why is the mutex needed? We already have mutex protection in qemu. > >> > > > > If it's unneeded, I'll happily remove it. I was assuming in a guest > > with multiple devices these could come in parallel, but maybe the guest > > is already serialized for config space accesses via cfc/cf8. > > > > The guest may or may not be serialized; we can't rely on that. But qemu > is, and we can. Ok, I'll drop the mutex here. > >> How often does the driver enable/disable the MSI (and, do you now why)? > >> If it's often enough it may justify kernel support. (We'll need this > >> patch in any case for kernels without this new support). > >> > > > > Seems like multiple times per second. I don't know why. Now I'm > > starting to get curious why nobody else seems to be hitting this. I'm > > seeing it on an e1000e NIC and Qlogic fibre channel. Is everyone else > > using MSI-X or regular interrupts vs MSI? > > > > When you say "multiple times", it is several, or a lot more? > > Maybe it is NAPI? The system would run out of the ~1000 available GSIs in a minute or two with just an e1000e available to the guest. So that's something on the order of 10/s. This also causes a printk in the host ever time the interrupt in enabled, which can't help performance and gets pretty annoying for syslog. I was guessing some kind of interrupt mitigation, such as NAPI, but a qlogic FC card seems to do it too (seemingly at a slower rate). Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Williamson wrote: >> >> When you say "multiple times", it is several, or a lot more? >> >> Maybe it is NAPI? >> > > The system would run out of the ~1000 available GSIs in a minute or two > with just an e1000e available to the guest. So that's something on the > order of 10/s. This also causes a printk in the host ever time the > interrupt in enabled, which can't help performance and gets pretty > annoying for syslog. I was guessing some kind of interrupt mitigation, > such as NAPI, but a qlogic FC card seems to do it too (seemingly at a > slower rate). > I see. And what is the path by which it is disabled? The mask bit in the MSI entry?
On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote: > Alex Williamson wrote: > >> > >> When you say "multiple times", it is several, or a lot more? > >> > >> Maybe it is NAPI? > >> > > > > The system would run out of the ~1000 available GSIs in a minute or two > > with just an e1000e available to the guest. So that's something on the > > order of 10/s. This also causes a printk in the host ever time the > > interrupt in enabled, which can't help performance and gets pretty > > annoying for syslog. I was guessing some kind of interrupt mitigation, > > such as NAPI, but a qlogic FC card seems to do it too (seemingly at a > > slower rate). > > > > I see. And what is the path by which it is disabled? The mask bit in > the MSI entry? Yes, I believe the only path is via a write to the MSI capability in the PCI config space. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 13, 2009 at 07:11:16AM -0600, Alex Williamson wrote: > On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote: > > Alex Williamson wrote: > > >> > > >> When you say "multiple times", it is several, or a lot more? > > >> > > >> Maybe it is NAPI? > > >> > > > > > > The system would run out of the ~1000 available GSIs in a minute or two > > > with just an e1000e available to the guest. So that's something on the > > > order of 10/s. This also causes a printk in the host ever time the > > > interrupt in enabled, which can't help performance and gets pretty > > > annoying for syslog. I was guessing some kind of interrupt mitigation, > > > such as NAPI, but a qlogic FC card seems to do it too (seemingly at a > > > slower rate). > > > > > > > I see. And what is the path by which it is disabled? The mask bit in > > the MSI entry? > > Yes, I believe the only path is via a write to the MSI capability in the > PCI config space. > > Alex Very surprising: I haven't seen any driver disable MSI expect on device destructor path. Is this a linux guest?
On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote: > On Wed, May 13, 2009 at 07:11:16AM -0600, Alex Williamson wrote: > > On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote: > > > Alex Williamson wrote: > > > >> > > > >> When you say "multiple times", it is several, or a lot more? > > > >> > > > >> Maybe it is NAPI? > > > >> > > > > > > > > The system would run out of the ~1000 available GSIs in a minute or two > > > > with just an e1000e available to the guest. So that's something on the > > > > order of 10/s. This also causes a printk in the host ever time the > > > > interrupt in enabled, which can't help performance and gets pretty > > > > annoying for syslog. I was guessing some kind of interrupt mitigation, > > > > such as NAPI, but a qlogic FC card seems to do it too (seemingly at a > > > > slower rate). > > > > > > > > > > I see. And what is the path by which it is disabled? The mask bit in > > > the MSI entry? > > > > Yes, I believe the only path is via a write to the MSI capability in the > > PCI config space. > > > > Alex > > Very surprising: I haven't seen any driver disable MSI expect on device > destructor path. Is this a linux guest? Yes, Debian 2.6.26 kernel. I'll check it it behaves the same on newer upstream kernels and try to figure out why it's doing it. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 13, 2009 at 08:15:29AM -0600, Alex Williamson wrote: > On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote: > > On Wed, May 13, 2009 at 07:11:16AM -0600, Alex Williamson wrote: > > > On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote: > > > > Alex Williamson wrote: > > > > >> > > > > >> When you say "multiple times", it is several, or a lot more? > > > > >> > > > > >> Maybe it is NAPI? > > > > >> > > > > > > > > > > The system would run out of the ~1000 available GSIs in a minute or two > > > > > with just an e1000e available to the guest. So that's something on the > > > > > order of 10/s. This also causes a printk in the host ever time the > > > > > interrupt in enabled, which can't help performance and gets pretty > > > > > annoying for syslog. I was guessing some kind of interrupt mitigation, > > > > > such as NAPI, but a qlogic FC card seems to do it too (seemingly at a > > > > > slower rate). > > > > > > > > > > > > > I see. And what is the path by which it is disabled? The mask bit in > > > > the MSI entry? > > > > > > Yes, I believe the only path is via a write to the MSI capability in the > > > PCI config space. > > > > > > Alex > > > > Very surprising: I haven't seen any driver disable MSI expect on device > > destructor path. Is this a linux guest? > > Yes, Debian 2.6.26 kernel. I'll check it it behaves the same on newer > upstream kernels and try to figure out why it's doing it. > > Alex Maybe power management powers down the card? Why would that be? Or maybe it detects some error in emulation and resets the card?
On Tue, May 12, 2009 at 10:41:29PM -0600, Alex Williamson wrote: > + gsi_count = kvm_get_gsi_count(kvm); > + /* Round up so we can search ints using ffs */ > + gsi_bytes = ((gsi_count + 31) / 32) * 4; > + kvm->used_gsi_bitmap = malloc(gsi_bytes); What happens on error in kvm_get_gsi_count? gsi_count will be negative ..
On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote: > On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote: > > On Wed, May 13, 2009 at 07:11:16AM -0600, Alex Williamson wrote: > > > On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote: > > > > Alex Williamson wrote: > > > > >> > > > > >> When you say "multiple times", it is several, or a lot more? > > > > >> > > > > >> Maybe it is NAPI? > > > > >> > > > > > > > > > > The system would run out of the ~1000 available GSIs in a minute or two > > > > > with just an e1000e available to the guest. So that's something on the > > > > > order of 10/s. This also causes a printk in the host ever time the > > > > > interrupt in enabled, which can't help performance and gets pretty > > > > > annoying for syslog. I was guessing some kind of interrupt mitigation, > > > > > such as NAPI, but a qlogic FC card seems to do it too (seemingly at a > > > > > slower rate). > > > > > > > > > > > > > I see. And what is the path by which it is disabled? The mask bit in > > > > the MSI entry? > > > > > > Yes, I believe the only path is via a write to the MSI capability in the > > > PCI config space. > > > > > > Alex > > > > Very surprising: I haven't seen any driver disable MSI expect on device > > destructor path. Is this a linux guest? > > Yes, Debian 2.6.26 kernel. I'll check it it behaves the same on newer > upstream kernels and try to figure out why it's doing it. Updating the guest to 2.6.29 seems to fix the interrupt toggling. So it's either something in older kernels or something debian introduced, but that seems unlikely. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-05-13 at 08:33 -0600, Alex Williamson wrote: > On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote: > > On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote: > > > Very surprising: I haven't seen any driver disable MSI expect on device > > > destructor path. Is this a linux guest? > > > > Yes, Debian 2.6.26 kernel. I'll check it it behaves the same on newer > > upstream kernels and try to figure out why it's doing it. > > Updating the guest to 2.6.29 seems to fix the interrupt toggling. So > it's either something in older kernels or something debian introduced, > but that seems unlikely. For the curious, this was fixed prior to 2.6.27-rc1 by this: commit ce6fce4295ba727b36fdc73040e444bd1aae64cd Author: Matthew Wilcox Date: Fri Jul 25 15:42:58 2008 -0600 PCI MSI: Don't disable MSIs if the mask bit isn't supported David Vrabel has a device which generates an interrupt storm on the INTx pin if we disable MSI interrupts altogether. Masking interrupts is only a performance optimisation, so we can ignore the request to mask the interrupt. It looks like without the maskbit attribute on MSI, the default way to mask an MSI interrupt was to toggle the MSI enable bit. This was introduced in 58e0543e8f355b32f0778a18858b255adb7402ae, so it's lifespan was probably 2.6.21 - 2.6.26. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Williamson wrote: > On Wed, 2009-05-13 at 08:33 -0600, Alex Williamson wrote: > >> On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote: >> >>> On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote: >>> >>>> Very surprising: I haven't seen any driver disable MSI expect on device >>>> destructor path. Is this a linux guest? >>>> >>> Yes, Debian 2.6.26 kernel. I'll check it it behaves the same on newer >>> upstream kernels and try to figure out why it's doing it. >>> >> Updating the guest to 2.6.29 seems to fix the interrupt toggling. So >> it's either something in older kernels or something debian introduced, >> but that seems unlikely. >> > > For the curious, this was fixed prior to 2.6.27-rc1 by this: > > commit ce6fce4295ba727b36fdc73040e444bd1aae64cd > Author: Matthew Wilcox > Date: Fri Jul 25 15:42:58 2008 -0600 > > PCI MSI: Don't disable MSIs if the mask bit isn't supported > > David Vrabel has a device which generates an interrupt storm on the INTx > pin if we disable MSI interrupts altogether. Masking interrupts is only > a performance optimisation, so we can ignore the request to mask the > interrupt. > > It looks like without the maskbit attribute on MSI, the default way to > mask an MSI interrupt was to toggle the MSI enable bit. This was > introduced in 58e0543e8f355b32f0778a18858b255adb7402ae, so it's lifespan > was probably 2.6.21 - 2.6.26. > > On the other hand, if the host device supports this maskbit attribute, we might want to support it. I'm not sure exactly how though. If we trap msi entry writes, we're inviting the guest to exit every time it wants to disable interrupts. If we don't, we're inviting spurious interrupts, which will cause unwanted exits and injections. Maybe we ought to let the guest write to the msi tables without trapping, and in the injection logic do something like msi_entry = *msi_entry_ptr; mb(); if (msi_entry != msi->last_msi_entry) msi_reconfigure(); if (msi_enabled(msi_entry)) insert_the_needle();
On Sun, May 17, 2009 at 11:47:53PM +0300, Avi Kivity wrote: > Alex Williamson wrote: >> On Wed, 2009-05-13 at 08:33 -0600, Alex Williamson wrote: >> >>> On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote: >>> >>>> On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote: >>>> >>>>> Very surprising: I haven't seen any driver disable MSI expect on device >>>>> destructor path. Is this a linux guest? >>>>> >>>> Yes, Debian 2.6.26 kernel. I'll check it it behaves the same on newer >>>> upstream kernels and try to figure out why it's doing it. >>>> >>> Updating the guest to 2.6.29 seems to fix the interrupt toggling. So >>> it's either something in older kernels or something debian introduced, >>> but that seems unlikely. >>> >> >> For the curious, this was fixed prior to 2.6.27-rc1 by this: >> >> commit ce6fce4295ba727b36fdc73040e444bd1aae64cd >> Author: Matthew Wilcox >> Date: Fri Jul 25 15:42:58 2008 -0600 >> >> PCI MSI: Don't disable MSIs if the mask bit isn't supported >> David Vrabel has a device which generates an interrupt storm on >> the INTx >> pin if we disable MSI interrupts altogether. Masking interrupts is only >> a performance optimisation, so we can ignore the request to mask the >> interrupt. >> >> It looks like without the maskbit attribute on MSI, the default way to >> mask an MSI interrupt was to toggle the MSI enable bit. This was >> introduced in 58e0543e8f355b32f0778a18858b255adb7402ae, so it's lifespan >> was probably 2.6.21 - 2.6.26. >> >> > > On the other hand, if the host device supports this maskbit attribute, > we might want to support it. I'm not sure exactly how though. > > If we trap msi entry writes, we're inviting the guest to exit every time > it wants to disable interrupts. If we don't, we're inviting spurious > interrupts, which will cause unwanted exits and injections. Avi, I think you are mixing up MSI and MSI-X. MSI does not have any tables: all changes go through configuration writes, which AFAIK we always trap. Isn't that right? On the other hand, in MSI-X mask bit is mandatory, not optional so we'll have to support it for assigned devices at some point. If we are worried about speed of masking/unmasking MSI-X interrupts for assigned devices (older kernels used to mask them, recent kernels leave this to drivers) we will probably need to have MSI-X support in the kernel, and have kernel examine the mask bit before injecting the interrupt, just like real devices do. > Maybe we ought to let the guest write to the msi tables without > trapping, and in the injection logic do something like > > msi_entry = *msi_entry_ptr; > mb(); > if (msi_entry != msi->last_msi_entry) > msi_reconfigure(); > if (msi_enabled(msi_entry)) > insert_the_needle(); I don't really understand why do you need the reconfigure and tracking last msi entry here.
Michael S. Tsirkin wrote: > On Sun, May 17, 2009 at 11:47:53PM +0300, Avi Kivity wrote: > >> Alex Williamson wrote: >> >>> On Wed, 2009-05-13 at 08:33 -0600, Alex Williamson wrote: >>> >>> >>>> On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote: >>>> >>>> >>>>> On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote: >>>>> >>>>> >>>>>> Very surprising: I haven't seen any driver disable MSI expect on device >>>>>> destructor path. Is this a linux guest? >>>>>> >>>>>> >>>>> Yes, Debian 2.6.26 kernel. I'll check it it behaves the same on newer >>>>> upstream kernels and try to figure out why it's doing it. >>>>> >>>>> >>>> Updating the guest to 2.6.29 seems to fix the interrupt toggling. So >>>> it's either something in older kernels or something debian introduced, >>>> but that seems unlikely. >>>> >>>> >>> For the curious, this was fixed prior to 2.6.27-rc1 by this: >>> >>> commit ce6fce4295ba727b36fdc73040e444bd1aae64cd >>> Author: Matthew Wilcox >>> Date: Fri Jul 25 15:42:58 2008 -0600 >>> >>> PCI MSI: Don't disable MSIs if the mask bit isn't supported >>> David Vrabel has a device which generates an interrupt storm on >>> the INTx >>> pin if we disable MSI interrupts altogether. Masking interrupts is only >>> a performance optimisation, so we can ignore the request to mask the >>> interrupt. >>> >>> It looks like without the maskbit attribute on MSI, the default way to >>> mask an MSI interrupt was to toggle the MSI enable bit. This was >>> introduced in 58e0543e8f355b32f0778a18858b255adb7402ae, so it's lifespan >>> was probably 2.6.21 - 2.6.26. >>> >>> >>> >> On the other hand, if the host device supports this maskbit attribute, >> we might want to support it. I'm not sure exactly how though. >> >> If we trap msi entry writes, we're inviting the guest to exit every time >> it wants to disable interrupts. If we don't, we're inviting spurious >> interrupts, which will cause unwanted exits and injections. >> > > Avi, I think you are mixing up MSI and MSI-X. MSI does not have any tables: > all changes go through configuration writes, which AFAIK we always trap. > Isn't that right? > Right. > On the other hand, in MSI-X mask bit is mandatory, not optional > so we'll have to support it for assigned devices at some point. > > If we are worried about speed of masking/unmasking MSI-X interrupts for > assigned devices (older kernels used to mask them, recent kernels leave > this to drivers) we will probably need to have MSI-X support in the > kernel, and have kernel examine the mask bit before injecting the > interrupt, just like real devices do. > Yes. Let's actually quantify this though. Several times per second doesn't qualify. >> Maybe we ought to let the guest write to the msi tables without >> trapping, and in the injection logic do something like >> >> msi_entry = *msi_entry_ptr; >> mb(); >> if (msi_entry != msi->last_msi_entry) >> msi_reconfigure(); >> if (msi_enabled(msi_entry)) >> insert_the_needle(); >> > > I don't really understand why do you need the reconfigure > and tracking last msi entry here. > The msi entry can change the vector and delivery mode, no? This way we can cache some stuff instead of decoding it each time. For example, if the entry points at a specific vcpu, we can cache the vcpu pointer, instead of searching for the apic ID.
On Mon, May 18, 2009 at 02:36:41PM +0300, Avi Kivity wrote: > Michael S. Tsirkin wrote: >> On Sun, May 17, 2009 at 11:47:53PM +0300, Avi Kivity wrote: >> >>> Alex Williamson wrote: >>> >>>> On Wed, 2009-05-13 at 08:33 -0600, Alex Williamson wrote: >>>> >>>>> On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote: >>>>> >>>>>> On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote: >>>>>> >>>>>>> Very surprising: I haven't seen any driver disable MSI expect on device >>>>>>> destructor path. Is this a linux guest? >>>>>>> >>>>>> Yes, Debian 2.6.26 kernel. I'll check it it behaves the same on newer >>>>>> upstream kernels and try to figure out why it's doing it. >>>>>> >>>>> Updating the guest to 2.6.29 seems to fix the interrupt toggling. So >>>>> it's either something in older kernels or something debian introduced, >>>>> but that seems unlikely. >>>>> >>>> For the curious, this was fixed prior to 2.6.27-rc1 by this: >>>> >>>> commit ce6fce4295ba727b36fdc73040e444bd1aae64cd >>>> Author: Matthew Wilcox >>>> Date: Fri Jul 25 15:42:58 2008 -0600 >>>> >>>> PCI MSI: Don't disable MSIs if the mask bit isn't supported >>>> David Vrabel has a device which generates an interrupt >>>> storm on the INTx >>>> pin if we disable MSI interrupts altogether. Masking interrupts is only >>>> a performance optimisation, so we can ignore the request to mask the >>>> interrupt. >>>> >>>> It looks like without the maskbit attribute on MSI, the default way to >>>> mask an MSI interrupt was to toggle the MSI enable bit. This was >>>> introduced in 58e0543e8f355b32f0778a18858b255adb7402ae, so it's lifespan >>>> was probably 2.6.21 - 2.6.26. >>>> >>>> >>> On the other hand, if the host device supports this maskbit >>> attribute, we might want to support it. I'm not sure exactly how >>> though. >>> >>> If we trap msi entry writes, we're inviting the guest to exit every >>> time it wants to disable interrupts. If we don't, we're inviting >>> spurious interrupts, which will cause unwanted exits and injections. >>> >> >> Avi, I think you are mixing up MSI and MSI-X. MSI does not have any tables: >> all changes go through configuration writes, which AFAIK we always trap. >> Isn't that right? >> > > Right. > >> On the other hand, in MSI-X mask bit is mandatory, not optional >> so we'll have to support it for assigned devices at some point. >> >> If we are worried about speed of masking/unmasking MSI-X interrupts for >> assigned devices (older kernels used to mask them, recent kernels leave >> this to drivers) we will probably need to have MSI-X support in the >> kernel, and have kernel examine the mask bit before injecting the >> interrupt, just like real devices do. >> > > Yes. Actually, if we do that, we'll need to address a race where a driver has updated the mask bit in the window after we tested it and before we inject the interrupt. Not sure how to do this. > Let's actually quantify this though. Several times per second > doesn't qualify. Oh, I didn't actually imply that we need to optimize this path. You seemed worried about the speed of masking the interrupt, and I commented that to optimize it we'll have to move it to kernel. >>> Maybe we ought to let the guest write to the msi tables without >>> trapping, and in the injection logic do something like >>> >>> msi_entry = *msi_entry_ptr; >>> mb(); >>> if (msi_entry != msi->last_msi_entry) >>> msi_reconfigure(); >>> if (msi_enabled(msi_entry)) >>> insert_the_needle(); >>> >> >> I don't really understand why do you need the reconfigure >> and tracking last msi entry here. >> > > The msi entry can change the vector and delivery mode, no? This way we > can cache some stuff instead of decoding it each time. For example, if > the entry points at a specific vcpu, we can cache the vcpu pointer, > instead of searching for the apic ID.
Michael S. Tsirkin wrote: >> >>> On the other hand, in MSI-X mask bit is mandatory, not optional >>> so we'll have to support it for assigned devices at some point. >>> >>> If we are worried about speed of masking/unmasking MSI-X interrupts for >>> assigned devices (older kernels used to mask them, recent kernels leave >>> this to drivers) we will probably need to have MSI-X support in the >>> kernel, and have kernel examine the mask bit before injecting the >>> interrupt, just like real devices do. >>> >>> >> Yes. >> > > Actually, if we do that, we'll need to address a race where a driver > has updated the mask bit in the window after we tested it > and before we inject the interrupt. Not sure how to do this. > The driver can't tell if the interrupt came first, so it's a valid race (real hardware has the same race).
On Mon, May 18, 2009 at 03:33:20PM +0300, Avi Kivity wrote: > Michael S. Tsirkin wrote: >>> >>>> On the other hand, in MSI-X mask bit is mandatory, not optional >>>> so we'll have to support it for assigned devices at some point. >>>> >>>> If we are worried about speed of masking/unmasking MSI-X interrupts for >>>> assigned devices (older kernels used to mask them, recent kernels leave >>>> this to drivers) we will probably need to have MSI-X support in the >>>> kernel, and have kernel examine the mask bit before injecting the >>>> interrupt, just like real devices do. >>>> >>> Yes. >>> >> >> Actually, if we do that, we'll need to address a race where a driver >> has updated the mask bit in the window after we tested it >> and before we inject the interrupt. Not sure how to do this. >> > > The driver can't tell if the interrupt came first, so it's a valid race > (real hardware has the same race). The driver for real device can do a read followed by sync_irq to flush out interrupts.
Michael S. Tsirkin wrote: > On Mon, May 18, 2009 at 03:33:20PM +0300, Avi Kivity wrote: > >> Michael S. Tsirkin wrote: >> >>>>> On the other hand, in MSI-X mask bit is mandatory, not optional >>>>> so we'll have to support it for assigned devices at some point. >>>>> >>>>> If we are worried about speed of masking/unmasking MSI-X interrupts for >>>>> assigned devices (older kernels used to mask them, recent kernels leave >>>>> this to drivers) we will probably need to have MSI-X support in the >>>>> kernel, and have kernel examine the mask bit before injecting the >>>>> interrupt, just like real devices do. >>>>> >>>>> >>>> Yes. >>>> >>>> >>> Actually, if we do that, we'll need to address a race where a driver >>> has updated the mask bit in the window after we tested it >>> and before we inject the interrupt. Not sure how to do this. >>> >>> >> The driver can't tell if the interrupt came first, so it's a valid race >> (real hardware has the same race). >> > > The driver for real device can do a read followed by sync_irq to flush > out interrupts. > If it generates the interrupt after masking it in the msi-x entry, we'll see it. If it generates the interrupt before masking it, it may or may not receive the interrupt, even on real hardware.
On Mon, May 18, 2009 at 04:55:30PM +0300, Avi Kivity wrote: > Michael S. Tsirkin wrote: >> On Mon, May 18, 2009 at 03:33:20PM +0300, Avi Kivity wrote: >> >>> Michael S. Tsirkin wrote: >>> >>>>>> On the other hand, in MSI-X mask bit is mandatory, not optional >>>>>> so we'll have to support it for assigned devices at some point. >>>>>> >>>>>> If we are worried about speed of masking/unmasking MSI-X interrupts for >>>>>> assigned devices (older kernels used to mask them, recent kernels leave >>>>>> this to drivers) we will probably need to have MSI-X support in the >>>>>> kernel, and have kernel examine the mask bit before injecting the >>>>>> interrupt, just like real devices do. >>>>>> >>>>> Yes. >>>>> >>>> Actually, if we do that, we'll need to address a race where a driver >>>> has updated the mask bit in the window after we tested it >>>> and before we inject the interrupt. Not sure how to do this. >>>> >>> The driver can't tell if the interrupt came first, so it's a valid >>> race (real hardware has the same race). >>> >> >> The driver for real device can do a read followed by sync_irq to flush >> out interrupts. >> > > If it generates the interrupt after masking it in the msi-x entry, we'll > see it. If it generates the interrupt before masking it, it may or may > not receive the interrupt, even on real hardware. Yes but in the later case, real hardware must re-send the pending interrupt after it is unmasked (that's the spec). We would just lose it.
Michael S. Tsirkin wrote: >> If it generates the interrupt after masking it in the msi-x entry, we'll >> see it. If it generates the interrupt before masking it, it may or may >> not receive the interrupt, even on real hardware. >> > > Yes but in the later case, real hardware must re-send the pending > interrupt after it is unmasked (that's the spec). We would just lose it. > That's a different matter. We need to buffer the interrupt pending bit, and a way for userspace to either query that buffer or have a conditional injection (inject_if_pending).
On Mon, May 18, 2009 at 05:46:09PM +0300, Avi Kivity wrote: > Michael S. Tsirkin wrote: >>> If it generates the interrupt after masking it in the msi-x entry, >>> we'll see it. If it generates the interrupt before masking it, it >>> may or may not receive the interrupt, even on real hardware. >>> >> >> Yes but in the later case, real hardware must re-send the pending >> interrupt after it is unmasked (that's the spec). We would just lose it. >> > > That's a different matter. We need to buffer the interrupt pending bit, > and a way for userspace to either query that buffer or have a > conditional injection (inject_if_pending). Here's the race as I see it: we discussed the possibility of making kernel and user share and actual memory page, and using that for MSI-X tables. host kernel want to send msi x message host kernel test mask bit: unmasked guest sets mask bit guest does read to flash msi writes guest does sync irq and makes sure there are no outstanging interrupts ---> at this stage guest expects not to get interrupts guest starts editing msix entry host kernel never saw mask so it sends message to the old address or even a corrupted address which the guest is in the middle of editing bad things happen This race is not easy to solve, except by catching writes to msix table, and syncronising them with interrupt delivery.
Michael S. Tsirkin wrote: > Here's the race as I see it: we discussed the possibility > of making kernel and user share and actual memory page, > and using that for MSI-X tables. > > host kernel want to send msi x message > host kernel test mask bit: unmasked > guest sets mask bit > guest does read to flash msi writes > guest does sync irq and makes sure there are no > outstanging interrupts > ---> at this stage guest expects not to get interrupts > guest starts editing msix entry > > host kernel never saw mask so it sends message to the old address > or even a corrupted address which the guest is in > the middle of editing > bad things happen > > This race is not easy to solve, except by catching writes to msix table, > and syncronising them with interrupt delivery. > You're right of course. In any case this is premature, we'll have to see if this happens with any frequency.
diff --git a/hw/device-assignment.c b/hw/device-assignment.c index a7365c8..a6cc9b9 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -561,8 +561,10 @@ static void free_dev_irq_entries(AssignedDevice *dev) { int i; - for (i = 0; i < dev->irq_entries_nr; i++) + for (i = 0; i < dev->irq_entries_nr; i++) { kvm_del_routing_entry(kvm_context, &dev->entry[i]); + kvm_free_irq_route_gsi(kvm_context, dev->entry[i].gsi); + } free(dev->entry); dev->entry = NULL; dev->irq_entries_nr = 0; diff --git a/kvm/libkvm/kvm-common.h b/kvm/libkvm/kvm-common.h index 591fb53..4b3cb51 100644 --- a/kvm/libkvm/kvm-common.h +++ b/kvm/libkvm/kvm-common.h @@ -66,8 +66,10 @@ struct kvm_context { #ifdef KVM_CAP_IRQ_ROUTING struct kvm_irq_routing *irq_routes; int nr_allocated_irq_routes; + void *used_gsi_bitmap; + int max_gsi; + pthread_mutex_t gsi_mutex; #endif - int max_used_gsi; }; int kvm_alloc_kernel_memory(kvm_context_t kvm, unsigned long memory, diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c index ba0a5d1..3eaa120 100644 --- a/kvm/libkvm/libkvm.c +++ b/kvm/libkvm/libkvm.c @@ -35,6 +35,7 @@ #include <errno.h> #include <sys/ioctl.h> #include <inttypes.h> +#include <pthread.h> #include "libkvm.h" #if defined(__x86_64__) || defined(__i386__) @@ -65,6 +66,8 @@ int kvm_abi = EXPECTED_KVM_API_VERSION; int kvm_page_size; +static inline void set_bit(uint32_t *buf, unsigned int bit); + struct slot_info { unsigned long phys_addr; unsigned long len; @@ -286,6 +289,9 @@ kvm_context_t kvm_init(struct kvm_callbacks *callbacks, int fd; kvm_context_t kvm; int r; +#ifdef KVM_CAP_IRQ_ROUTING + int gsi_count, gsi_bytes, i; +#endif fd = open("/dev/kvm", O_RDWR); if (fd == -1) { @@ -322,6 +328,25 @@ kvm_context_t kvm_init(struct kvm_callbacks *callbacks, kvm->dirty_pages_log_all = 0; kvm->no_irqchip_creation = 0; kvm->no_pit_creation = 0; +#ifdef KVM_CAP_IRQ_ROUTING + pthread_mutex_init(&kvm->gsi_mutex, NULL); + + gsi_count = kvm_get_gsi_count(kvm); + /* Round up so we can search ints using ffs */ + gsi_bytes = ((gsi_count + 31) / 32) * 4; + kvm->used_gsi_bitmap = malloc(gsi_bytes); + if (!kvm->used_gsi_bitmap) + goto out_close; + memset(kvm->used_gsi_bitmap, 0, gsi_bytes); + kvm->max_gsi = gsi_bytes * 8; + + /* Mark all the IOAPIC pin GSIs and any over-allocated + * GSIs as already in use. */ + for (i = 0; i < KVM_IOAPIC_NUM_PINS; i++) + set_bit(kvm->used_gsi_bitmap, i); + for (i = gsi_count; i < kvm->max_gsi; i++) + set_bit(kvm->used_gsi_bitmap, i); +#endif return kvm; out_close: @@ -1298,8 +1323,6 @@ int kvm_add_routing_entry(kvm_context_t kvm, new->flags = entry->flags; new->u = entry->u; - if (entry->gsi > kvm->max_used_gsi) - kvm->max_used_gsi = entry->gsi; return 0; #else return -ENOSYS; @@ -1404,18 +1427,54 @@ int kvm_commit_irq_routes(kvm_context_t kvm) #endif } +#ifdef KVM_CAP_IRQ_ROUTING +static inline void set_bit(uint32_t *buf, unsigned int bit) +{ + buf[bit / 32] |= 1U << (bit % 32); +} + +static inline void clear_bit(uint32_t *buf, unsigned int bit) +{ + buf[bit / 32] &= ~(1U << (bit % 32)); +} + +static int kvm_find_free_gsi(kvm_context_t kvm) +{ + int i, bit, gsi; + uint32_t *buf = kvm->used_gsi_bitmap; + + for (i = 0; i < kvm->max_gsi / 32; i++) { + bit = ffs(~buf[i]); + if (!bit) + continue; + + gsi = bit - 1 + i * 32; + set_bit(buf, gsi); + return gsi; + } + + return -ENOSPC; +} +#endif + int kvm_get_irq_route_gsi(kvm_context_t kvm) { + int gsi = -ENOSYS; + #ifdef KVM_CAP_IRQ_ROUTING - if (kvm->max_used_gsi >= KVM_IOAPIC_NUM_PINS) { - if (kvm->max_used_gsi <= kvm_get_gsi_count(kvm)) - return kvm->max_used_gsi + 1; - else - return -ENOSPC; - } else - return KVM_IOAPIC_NUM_PINS; -#else - return -ENOSYS; + pthread_mutex_lock(&kvm->gsi_mutex); + gsi = kvm_find_free_gsi(kvm); + pthread_mutex_unlock(&kvm->gsi_mutex); +#endif + return gsi; +} + +void kvm_free_irq_route_gsi(kvm_context_t kvm, int gsi) +{ +#ifdef KVM_CAP_IRQ_ROUTING + pthread_mutex_lock(&kvm->gsi_mutex); + clear_bit(kvm->used_gsi_bitmap, gsi); + pthread_mutex_unlock(&kvm->gsi_mutex); #endif } diff --git a/kvm/libkvm/libkvm.h b/kvm/libkvm/libkvm.h index 4821a1e..845bb8a 100644 --- a/kvm/libkvm/libkvm.h +++ b/kvm/libkvm/libkvm.h @@ -856,6 +856,16 @@ int kvm_commit_irq_routes(kvm_context_t kvm); */ int kvm_get_irq_route_gsi(kvm_context_t kvm); +/*! + * \brief Free used GSI number + * + * Free used GSI number acquired from kvm_get_irq_route_gsi() + * + * \param kvm Pointer to the current kvm_context + * \param gsi GSI number to free + */ +void kvm_free_irq_route_gsi(kvm_context_t kvm, int gsi); + #ifdef KVM_CAP_DEVICE_MSIX int kvm_assign_set_msix_nr(kvm_context_t kvm, struct kvm_assigned_msix_nr *msix_nr);
We're currently using a counter to track the most recent GSI we've handed out. This quickly hits KVM_MAX_IRQ_ROUTES when using device assignment with a driver that regularly toggles the MSI enable bit. This can mean only a few minutes of usable run time. Instead, track used GSIs in a bitmap. Signed-off-by: Alex Williamson <alex.williamson@hp.com> --- v2: Added mutex to protect gsi bitmap v3: Updated for comments from Michael Tsirkin No longer depends on "[PATCH] kvm: device-assignment: Catch GSI overflow" v4: Fix gsi_bytes calculation noted by Sheng Yang hw/device-assignment.c | 4 ++ kvm/libkvm/kvm-common.h | 4 ++ kvm/libkvm/libkvm.c | 81 +++++++++++++++++++++++++++++++++++++++++------ kvm/libkvm/libkvm.h | 10 ++++++ 4 files changed, 86 insertions(+), 13 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html