Message ID | 20190131185656.17972-1-logang@deltatee.com (mailing list archive) |
---|---|
Headers | show |
Series | Support using MSI interrupts in ntb_transport | expand |
On 1/31/2019 11:56 AM, Logan Gunthorpe wrote: > Hi, > > This patch series adds optional support for using MSI interrupts instead > of NTB doorbells in ntb_transport. This is desirable seeing doorbells on > current hardware are quite slow and therefore switching to MSI interrupts > provides a significant performance gain. On switchtec hardware, a simple > apples-to-apples comparison shows ntb_netdev/iperf numbers going from > 3.88Gb/s to 14.1Gb/s when switching to MSI interrupts. > > To do this, a couple changes are required outside of the NTB tree: > > 1) The IOMMU must know to accept MSI requests from aliased bused numbers > seeing NTB hardware typically sends proxied request IDs through > additional requester IDs. The first patch in this series adds support > for the Intel IOMMU. A quirk to add these aliases for switchtec hardware > was already accepted. See commit ad281ecf1c7d ("PCI: Add DMA alias quirk > for Microsemi Switchtec NTB") for a description of NTB proxy IDs and why > this is necessary. > > 2) NTB transport (and other clients) may often need more MSI interrupts > than the NTB hardware actually advertises support for. However, seeing > these interrupts will not be triggered by the hardware but through an > NTB memory window, the hardware does not actually need support or need > to know about them. Therefore we add the concept of Virtual MSI > interrupts which are allocated just like any other MSI interrupt but > are not programmed into the hardware's MSI table. This is done in > Patch 2 and then made use of in Patch 3. Logan, Does this work when the system moves the MSI vector either via software (irqbalance) or BIOS APIC programming (some modes cause round robin behavior)? > > The remaining patches in this series add a library for dealing with MSI > interrupts, a test client and finally support in ntb_transport. > > The series is based off of v5.0-rc4 and I've tested it on top of a > of the patches I've already sent to the NTB tree (though they are > independent changes). A git repo is available here: > > https://github.com/sbates130272/linux-p2pmem/ ntb_transport_msi_v1 > > Thanks, > > Logan > > -- > > Logan Gunthorpe (9): > iommu/vt-d: Allow interrupts from the entire bus for aliased devices > PCI/MSI: Support allocating virtual MSI interrupts > PCI/switchtec: Add module parameter to request more interrupts > NTB: Introduce functions to calculate multi-port resource index > NTB: Rename ntb.c to support multiple source files in the module > NTB: Introduce MSI library > NTB: Introduce NTB MSI Test Client > NTB: Add ntb_msi_test support to ntb_test > NTB: Add MSI interrupt support to ntb_transport > > drivers/iommu/intel_irq_remapping.c | 12 + > drivers/ntb/Kconfig | 10 + > drivers/ntb/Makefile | 3 + > drivers/ntb/{ntb.c => core.c} | 0 > drivers/ntb/msi.c | 313 ++++++++++++++++++ > drivers/ntb/ntb_transport.c | 134 +++++++- > drivers/ntb/test/Kconfig | 9 + > drivers/ntb/test/Makefile | 1 + > drivers/ntb/test/ntb_msi_test.c | 416 ++++++++++++++++++++++++ > drivers/pci/msi.c | 51 ++- > drivers/pci/switch/switchtec.c | 12 +- > include/linux/msi.h | 1 + > include/linux/ntb.h | 139 ++++++++ > include/linux/pci.h | 9 + > tools/testing/selftests/ntb/ntb_test.sh | 54 ++- > 15 files changed, 1150 insertions(+), 14 deletions(-) > rename drivers/ntb/{ntb.c => core.c} (100%) > create mode 100644 drivers/ntb/msi.c > create mode 100644 drivers/ntb/test/ntb_msi_test.c > > -- > 2.19.0 >
On 2019-01-31 1:20 p.m., Dave Jiang wrote: > Does this work when the system moves the MSI vector either via software > (irqbalance) or BIOS APIC programming (some modes cause round robin > behavior)? I don't know how irqbalance works, and I'm not sure what you are referring to by BIOS APIC programming, however I would expect these things would not be a problem. The MSI code I'm presenting here doesn't do anything crazy with the interrupts, it allocates and uses them just as any PCI driver would. The only real difference here is that instead of a piece of hardware sending the IRQ TLP, it will be sent through the memory window (which, from the OS's perspective, is just coming from an NTB hardware proxy alias). Logan
On 1/31/2019 1:48 PM, Logan Gunthorpe wrote: > > On 2019-01-31 1:20 p.m., Dave Jiang wrote: >> Does this work when the system moves the MSI vector either via software >> (irqbalance) or BIOS APIC programming (some modes cause round robin >> behavior)? > > I don't know how irqbalance works, and I'm not sure what you are > referring to by BIOS APIC programming, however I would expect these > things would not be a problem. > > The MSI code I'm presenting here doesn't do anything crazy with the > interrupts, it allocates and uses them just as any PCI driver would. The > only real difference here is that instead of a piece of hardware sending > the IRQ TLP, it will be sent through the memory window (which, from the > OS's perspective, is just coming from an NTB hardware proxy alias). > > Logan Right. I did that as a hack a while back for some silicon errata workaround. When the vector moves, the address for the LAPIC changes. So unless it gets updated, you end up writing to the old location and lose all the new interrupts. irqbalance is a user daemon that rotates the system interrupts around to ensure that not all interrupts are pinned on a single core. I think it's enabled by default on several distros. Although MSIX has nothing to do with the IOAPIC, the mode that the APIC is programmed can have an influence on how the interrupts are delivered. There are certain Intel platforms (I don't know if AMD does anything like that) puts the IOAPIC in a certain configuration that causes the interrupts to be moved in a round robin fashion. I think it's physical flat mode? I don't quite recall. Normally on the low end Xeons. It's probably worth doing a test run with the irqbalance daemon running and make sure you traffic stream doesn't all of sudden stop. >
On 2019-01-31 1:58 p.m., Dave Jiang wrote: > > On 1/31/2019 1:48 PM, Logan Gunthorpe wrote: >> >> On 2019-01-31 1:20 p.m., Dave Jiang wrote: >>> Does this work when the system moves the MSI vector either via software >>> (irqbalance) or BIOS APIC programming (some modes cause round robin >>> behavior)? >> >> I don't know how irqbalance works, and I'm not sure what you are >> referring to by BIOS APIC programming, however I would expect these >> things would not be a problem. >> >> The MSI code I'm presenting here doesn't do anything crazy with the >> interrupts, it allocates and uses them just as any PCI driver would. The >> only real difference here is that instead of a piece of hardware sending >> the IRQ TLP, it will be sent through the memory window (which, from the >> OS's perspective, is just coming from an NTB hardware proxy alias). >> >> Logan > Right. I did that as a hack a while back for some silicon errata > workaround. When the vector moves, the address for the LAPIC changes. So > unless it gets updated, you end up writing to the old location and lose > all the new interrupts. irqbalance is a user daemon that rotates the > system interrupts around to ensure that not all interrupts are pinned on > a single core. Yes, that would be a problem if something changes the MSI vectors out from under us. Seems like that would be a bit difficult to do even with regular hardware. So far I haven't seen anything that would do that. If you know of where in the kernel this happens I'd be interested in getting a pointer to the flow in the code. If that is the case this MSI stuff will need to get much more complicated... > I think it's enabled by default on several distros. > Although MSIX has nothing to do with the IOAPIC, the mode that the APIC > is programmed can have an influence on how the interrupts are delivered. > There are certain Intel platforms (I don't know if AMD does anything > like that) puts the IOAPIC in a certain configuration that causes the > interrupts to be moved in a round robin fashion. I think it's physical > flat mode? I don't quite recall. Normally on the low end Xeons. It's > probably worth doing a test run with the irqbalance daemon running and > make sure you traffic stream doesn't all of sudden stop. I've tested with irqbalance running and haven't found any noticeable difference. Logan
On 1/31/2019 3:39 PM, Logan Gunthorpe wrote: > > On 2019-01-31 1:58 p.m., Dave Jiang wrote: >> On 1/31/2019 1:48 PM, Logan Gunthorpe wrote: >>> On 2019-01-31 1:20 p.m., Dave Jiang wrote: >>>> Does this work when the system moves the MSI vector either via software >>>> (irqbalance) or BIOS APIC programming (some modes cause round robin >>>> behavior)? >>> I don't know how irqbalance works, and I'm not sure what you are >>> referring to by BIOS APIC programming, however I would expect these >>> things would not be a problem. >>> >>> The MSI code I'm presenting here doesn't do anything crazy with the >>> interrupts, it allocates and uses them just as any PCI driver would. The >>> only real difference here is that instead of a piece of hardware sending >>> the IRQ TLP, it will be sent through the memory window (which, from the >>> OS's perspective, is just coming from an NTB hardware proxy alias). >>> >>> Logan >> Right. I did that as a hack a while back for some silicon errata >> workaround. When the vector moves, the address for the LAPIC changes. So >> unless it gets updated, you end up writing to the old location and lose >> all the new interrupts. irqbalance is a user daemon that rotates the >> system interrupts around to ensure that not all interrupts are pinned on >> a single core. > Yes, that would be a problem if something changes the MSI vectors out > from under us. Seems like that would be a bit difficult to do even with > regular hardware. So far I haven't seen anything that would do that. If > you know of where in the kernel this happens I'd be interested in > getting a pointer to the flow in the code. If that is the case this MSI > stuff will need to get much more complicated... I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So maybe take a look at the code that starts from there and see if it would have any impact on your stuff. > >> I think it's enabled by default on several distros. >> Although MSIX has nothing to do with the IOAPIC, the mode that the APIC >> is programmed can have an influence on how the interrupts are delivered. >> There are certain Intel platforms (I don't know if AMD does anything >> like that) puts the IOAPIC in a certain configuration that causes the >> interrupts to be moved in a round robin fashion. I think it's physical >> flat mode? I don't quite recall. Normally on the low end Xeons. It's >> probably worth doing a test run with the irqbalance daemon running and >> make sure you traffic stream doesn't all of sudden stop. > I've tested with irqbalance running and haven't found any noticeable > difference. > > Logan
On 2019-01-31 3:46 p.m., Dave Jiang wrote: > I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So > maybe take a look at the code that starts from there and see if it would > have any impact on your stuff. Ok, well on my system I can write to the smp_affinity all day and the MSI interrupts still work fine. The MSI code is a bit difficult to trace and audit with all the different chips and the parent chips which I don't have a good understanding of. But I can definitely see that it could be possible for some chips to change the address as smp_affinitiy will eventually sometimes call msi_domain_set_affinity() which does seem to recompose the message and write it back to the chip. So, I could relatively easily add a callback to msi_desc to catch this and resend the MSI address/data. However, I'm not sure how this is ever done atomically. It seems like there would be a race while the device updates its address where old interrupts could be triggered. This race would be much longer for us when sending this information over the NTB link. Though, I guess if the only change is that it encodes CPU information in the address then that would not be an issue. However, I'm not sure I can say that for certain without a comprehensive understanding of all the IRQ chips. Any thoughts on this? Logan
On 1/31/2019 4:41 PM, Logan Gunthorpe wrote: > > On 2019-01-31 3:46 p.m., Dave Jiang wrote: >> I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So >> maybe take a look at the code that starts from there and see if it would >> have any impact on your stuff. > Ok, well on my system I can write to the smp_affinity all day and the > MSI interrupts still work fine. Maybe your code is ok then. If the stats show up in /proc/interrupts then you can see it moving to different cores. > The MSI code is a bit difficult to trace and audit with all the > different chips and the parent chips which I don't have a good > understanding of. But I can definitely see that it could be possible for > some chips to change the address as smp_affinitiy will eventually > sometimes call msi_domain_set_affinity() which does seem to recompose > the message and write it back to the chip. > > So, I could relatively easily add a callback to msi_desc to catch this > and resend the MSI address/data. However, I'm not sure how this is ever > done atomically. It seems like there would be a race while the device > updates its address where old interrupts could be triggered. This race > would be much longer for us when sending this information over the NTB > link. Though, I guess if the only change is that it encodes CPU > information in the address then that would not be an issue. However, I'm > not sure I can say that for certain without a comprehensive > understanding of all the IRQ chips. > > Any thoughts on this? Yeah I'm not sure what to do about it either as I'm not super familiar with that area either. Just making note of what I encountered. And you are right, the updated info has to go over NTB for the other side to write to the updated place. So there's a lot of latency involved. > > Logan
On 2019-01-31 4:48 p.m., Dave Jiang wrote: > > On 1/31/2019 4:41 PM, Logan Gunthorpe wrote: >> >> On 2019-01-31 3:46 p.m., Dave Jiang wrote: >>> I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So >>> maybe take a look at the code that starts from there and see if it would >>> have any impact on your stuff. >> Ok, well on my system I can write to the smp_affinity all day and the >> MSI interrupts still work fine. > > Maybe your code is ok then. If the stats show up in /proc/interrupts > then you can see it moving to different cores. Yes, I did check that the stats change CPU in proc interrupts. > Yeah I'm not sure what to do about it either as I'm not super familiar > with that area either. Just making note of what I encountered. And you > are right, the updated info has to go over NTB for the other side to > write to the updated place. So there's a lot of latency involved. Ok, well I'll implement the callback anyway for v2. Better safe than sorry. We can operate on the assumption that someone thought of the race condition and if we ever see reports of lost interrupts we'll know where to look. Logan