Message ID | 20200821002949.049867339@linutronix.de (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86, PCI, XEN, genirq ...: Prepare for device MSI | expand |
On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote: > +static void ims_mask_irq(struct irq_data *data) > +{ > + struct msi_desc *desc = irq_data_get_msi_desc(data); > + struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem; > + u32 __iomem *ctrl = &slot->ctrl; > + > + iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl); Just to be clear, this is exactly the sort of operation we can't do with non-MSI interrupts. For a real PCI device to execute this it would have to keep the data on die. I saw the idxd driver was doing something like this, I assume it avoids trouble because it is a fake PCI device integrated with the CPU, not on a real PCI bus? It is really nice to see irq_domain used properly in x86! Thanks, Jason
On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote: > On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote: >> +static void ims_mask_irq(struct irq_data *data) >> +{ >> + struct msi_desc *desc = irq_data_get_msi_desc(data); >> + struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem; >> + u32 __iomem *ctrl = &slot->ctrl; >> + >> + iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl); > > Just to be clear, this is exactly the sort of operation we can't do > with non-MSI interrupts. For a real PCI device to execute this it > would have to keep the data on die. We means NVIDIA and your new device, right? So if I understand correctly then the queue memory where the MSI descriptor sits is in RAM. How is that supposed to work if interrupt remapping is disabled? That means irq migration and proper disabling of an interrupt become an interesting exercise. I'm so not looking forward to that. If interrupt remapping is enabled then both are trivial because then the irq chip can delegate everything to the parent chip, i.e. the remapping unit. Can you please explain that a bit more precise? > I saw the idxd driver was doing something like this, I assume it > avoids trouble because it is a fake PCI device integrated with the > CPU, not on a real PCI bus? That's how it is implemented as far as I understood the patches. It's device memory therefore iowrite32(). > It is really nice to see irq_domain used properly in x86! If you ignore the abuse in XEN :) And to be fair proper and usable (hierarchical) irq domains originate from x86 and happened to solve quite a few horrorshows on the ARM side. Just back then when we converted the original maze, nobody had a good idea and the stomach to touch XEN. Thanks, tglx
On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote: > On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote: > > On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote: > >> +static void ims_mask_irq(struct irq_data *data) > >> +{ > >> + struct msi_desc *desc = irq_data_get_msi_desc(data); > >> + struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem; > >> + u32 __iomem *ctrl = &slot->ctrl; > >> + > >> + iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl); > > > > Just to be clear, this is exactly the sort of operation we can't do > > with non-MSI interrupts. For a real PCI device to execute this it > > would have to keep the data on die. > > We means NVIDIA and your new device, right? We'd like to use this in the current Mellanox NIC HW, eg the mlx5 driver. (NVIDIA acquired Mellanox recently) > So if I understand correctly then the queue memory where the MSI > descriptor sits is in RAM. Yes, IMHO that is the whole point of this 'IMS' stuff. If devices could have enough on-die memory then they could just use really big MSI-X tables. Currently due to on-die memory constraints mlx5 is limited to a few hundred MSI-X vectors. Since MSI-X tables are exposed via MMIO they can't be 'swapped' to RAM. Moving away from MSI-X's MMIO access model allows them to be swapped to RAM. The cost is that accessing them for update is a command/response operation not a MMIO operation. The HW is already swapping the queues causing the interrupts to RAM, so adding a bit of additional data to store the MSI addr/data is reasonable. To give some sense, a 'working set' for the NIC device in some cases can be hundreds of megabytes of data. System RAM is used to store this, and precious on-die memory holds some dynamic active set, much like a processor cache. > How is that supposed to work if interrupt remapping is disabled? The best we can do is issue a command to the device and spin/sleep until completion. The device will serialize everything internally. If the device has died the driver has code to detect and trigger a PCI function reset which will definitely stop the interrupt. So, the implementation of these functions would be to push any change onto a command queue, trigger the device to DMA the command, spin/sleep until the device returns a response and then continue on. If the device doesn't return a response in a time window then trigger a WQ to do a full device reset. The spin/sleep is only needed if the update has to be synchronous, so things like rebalancing could just push the rebalancing work and immediately return. > If interrupt remapping is enabled then both are trivial because then the > irq chip can delegate everything to the parent chip, i.e. the remapping > unit. I did like this notion that IRQ remapping could avoid the overhead of spin/spleep. Most of the use cases we have for this will require the IOMMU anyhow. > > I saw the idxd driver was doing something like this, I assume it > > avoids trouble because it is a fake PCI device integrated with the > > CPU, not on a real PCI bus? > > That's how it is implemented as far as I understood the patches. It's > device memory therefore iowrite32(). I don't know anything about idxd.. Given the scale of interrupt need I assumed the idxd HW had some hidden swapping to RAM. Since it is on-die with the CPU there are a bunch of ways I could imagine Intel could make MMIO triggered swapping work that are not available to a true PCI-E device. Jason
On Fri, Aug 21 2020 at 17:17, Jason Gunthorpe wrote: > On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote: >> So if I understand correctly then the queue memory where the MSI >> descriptor sits is in RAM. > > Yes, IMHO that is the whole point of this 'IMS' stuff. If devices > could have enough on-die memory then they could just use really big > MSI-X tables. Currently due to on-die memory constraints mlx5 is > limited to a few hundred MSI-X vectors. Right, that's the limit of a particular device, but nothing prevents you to have a larger table on a new device. The MSI-X limitation to 2048 is defined by the PCI spec and you'd need either some non spec compliant abuse of the reserved size bits or some extra config entry. So IMS is a way to work around that. But I understand why you want to move them to main memory, but you have to deal with the problems this creates upfront. > Since MSI-X tables are exposed via MMIO they can't be 'swapped' to > RAM. > > Moving away from MSI-X's MMIO access model allows them to be swapped > to RAM. The cost is that accessing them for update is a > command/response operation not a MMIO operation. > > The HW is already swapping the queues causing the interrupts to RAM, > so adding a bit of additional data to store the MSI addr/data is > reasonable. Makes sense. >> How is that supposed to work if interrupt remapping is disabled? > > The best we can do is issue a command to the device and spin/sleep > until completion. The device will serialize everything internally. > > If the device has died the driver has code to detect and trigger a > PCI function reset which will definitely stop the interrupt. If that interrupt is gone into storm mode for some reason then this will render your machine unusable before you can do that. > So, the implementation of these functions would be to push any change > onto a command queue, trigger the device to DMA the command, spin/sleep > until the device returns a response and then continue on. If the > device doesn't return a response in a time window then trigger a WQ to > do a full device reset. I really don't want to do that with the irq descriptor lock held or in case of affinity from the interrupt handler as we have to do with PCI MSI/MSI-X due to the horrors of the X86 interrupt delivery trainwreck. Also you cannot call into command queue code from interrupt disabled and interrupt descriptor lock held sections. You can try, but lockdep will yell at you immediately. There is also CPU hotplug where we have to force migrate an interrupt away from an outgoing CPU. This needs some serious thought. One question is whether the device can see partial updates to that memory due to the async 'swap' of context from the device CPU. So we have address_lo, address_hi, data and ctrl. Each of them 32 bit. address_hi is only relevant when the number of CPUs is > 255 which requires X2APIC which in turn requires interrupt remapping. For all others the address_hi value never changes. Let's ignore that case for now, but see further down. So what's interesting is address_lo and data. If the device sees an partial update, i.e. address_lo is written and the device grabs the update before data is written then the next interrupt will end up in lala land. We have code for that in place in msi_set_affinity() in arch/x86/kernel/apic/msi.c. Get eyecancer protection glasses before opening that and keep beer ready to wipe out the horrors immediately afterwards. If the device updates the data only when a command is issued then this is not a problem, but it causes other problems because you still cannot access the command queue from that context. This makes it even worse for the CPU hotplug case. But see all of the reasoning on that. If it takes whatever it sees while grabbing the state when switching to a different queue or at the point of actual interrupt delivery, then you have a problem. Not only you, I'm going to have one as well because I'm going to be the poor sod to come up with the workaround. So we better address that _before_ you start deploying this piece of art. I'm not really interested in another slighly different and probably more horrible version of the same story. Don't blame me, it's the way how Intel decided to make this "work". There are a couple of options to ensure that the device will never see inconsistent state: 1) Use a locked 16 byte wide operation (cpmxchg16) which is not available on 32bit 2) Order the MSG entry differently in the queue storage: u32 address_lo u32 data u32 address_hi u32 ctrl And then enforce an 8 byte store on 64 bit which is guaranteed to be atomic vs. other CPUs and bus agents, i.e. DMA. I said enforce because compilers are known to do stupid things. Both are fine for me and the only caveat is that the access does not go accross a cache line boundary. The restriction to 64bit shouldn't be a problem either. Running such a device on 32bit causes more problems than it solves :) > The spin/sleep is only needed if the update has to be synchronous, so > things like rebalancing could just push the rebalancing work and > immediately return. Interrupt migration is async anyway. An interrupt might have been sent to the old vector just before the new vector was written. That's already dealt with. The old vector is cleaned up when the first interrupt arrives on the new vector which is the most reliable indicator that it's done. In that case you can avoid issuing a command, but that needs some thought as well when the queue data is never reloaded. But you can mark the queue that affinity has changed and let the next operation on the queue (RX, TX, whatever) which needs to talk to the device anyway deal with it, i.e. set some command flag in the next operation which tells the queue to reload that message. The only exception is CPU hotplug, but I have an idea how to deal with that. Aside of that some stuff want's to be synchronous though. e.g. shutdown, startup. irq chips have already a mechanism in place to deal with stuff which cannot be handled from within the irq descriptor spinlock held and interrupt disabled section. The mechanism was invented to deal with interrupt chips which are connected to i2c, spi, etc.. The access to an interrupt chip control register has to queue stuff on the bus and wait for completion. Obviously not what you can do from interrupt disabled, raw spinlock held context either. So we have for most operations (except affinity setting) the concept of update on lock release. For these devices the interrupt chip which handles all lines on that controller on the slow bus has an additional lock, called bus lock. The core code does not know about that lock at all. It's managed at the irq chip side. The irqchip has two callbacks: irq_bus_lock() and irq_bus_sync_unlock(). irq_bus_lock() is invoked before interrupts are disabled and the spinlock is taken and irq_bus_sync_unlock() after releasing the spinlock and reenabling interrupts. The "real" chip operations like mask, unmask etc. are operating on an chip internal state cache. For such devices irq_bus_lock() usually takes a sleepable lock (mutex) to protect the state cache and the update logic over the slow bus. irq_bus_sync_unlock() releases that lock, but before doing so it checks whether the operation has changed the state cache and if so it queues a command on the slow bus and waits for completion. That makes sure that the device state and the state cache are in sync before the next operation on a maybe different irq line on the same chip happens. Now for your case you might just not have irq_mask()/irq_unmask() callbacks or simple ones which just update the queue memory in RAM, but then you want irq_disable()/irq_enable() callbacks which manipulate state cache and then provide the irq_bus_lock() and irq_bus_sync_unlock() callbacks as well which do not necessarily need a lock underneath, but the unlock side implements the 'Queue command and wait for completion' part. Now coming back to affinity setting. I'd love to avoid adding the bus lock magic to those interfaces because until now they can be called and are called from atomic contexts. And obviously none of the devices which use the buslock magic support affinity setting because they all deliver a single interrupt to a demultiplex interrupt and that one is usually sitting at the CPU level where interrupt steering works. If we really can get away with atomically updating the message as outlined above and just let it happen at some point in the future then most problems are solved, except for the nastyness of CPU hotplug. But that's actually a non issue. Nothing prevents us from having an early 'migrate interrupts away from the outgoing CPU hotplug state' which runs in thread context and can therefore utilize the buslock mechanism. Actually I was thinking about that for other reasons already. That state would need some thought and consequently some minor changes to the affinity mask checks to prevent that the interrupt gets migrated back to the outgoing CPU before that CPU reaches offline state. Nothing fundamental though. Just to be clear: We really need to do that at the core level and not again in some dark place in a driver as that will cause state inconsistency and hard to debug wreckage. >> If interrupt remapping is enabled then both are trivial because then the >> irq chip can delegate everything to the parent chip, i.e. the remapping >> unit. > > I did like this notion that IRQ remapping could avoid the overhead of > spin/spleep. Most of the use cases we have for this will require the > IOMMU anyhow. You still need to support !remap scenarios I fear. And even for the remap case you need some of that bus lock magic to handle startup and teardown properly without the usual horrible hacks in the driver. If your hard^Wfirmware does the right thing then the only place you need to worry about the command queueing is startup and teardown and the extra bit for the early hotplug migration. Let me summarize what I think would be the sane solution for this: 1) Utilize atomic writes for either all 16 bytes or reorder the bytes and update 8 bytes atomically which is sufficient as the wide address is only used with irq remapping and the MSI message in the device is never changed after startup. 2) No requirement for issuing a command for regular migration operations as they have no requirements to be synchronous. Eventually store some state to force a reload on the next regular queue operation. 3) No requirement for issuing a command for mask and unmask operations. The core code uses and handles lazy masking already. So if the hardware causes the lazyness, so be it. 4) Issue commands for startup and teardown as they need to be synchronous 5) Have an early migration state for CPU hotunplug which issues a command from appropriate context. That would even allow to handle queue shutdown for managed interrupts when the last CPU in the managed affinity set goes down. Restart of such a managed interrupt when the first CPU in an affinity set comes online again would only need minor modifications of the existing code to make it work. Thoughts? Thanks, tglx
On Sat, Aug 22, 2020 at 01:47:12AM +0200, Thomas Gleixner wrote: > On Fri, Aug 21 2020 at 17:17, Jason Gunthorpe wrote: > > On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote: > >> So if I understand correctly then the queue memory where the MSI > >> descriptor sits is in RAM. > > > > Yes, IMHO that is the whole point of this 'IMS' stuff. If devices > > could have enough on-die memory then they could just use really big > > MSI-X tables. Currently due to on-die memory constraints mlx5 is > > limited to a few hundred MSI-X vectors. > > Right, that's the limit of a particular device, but nothing prevents you > to have a larger table on a new device. Well, physics are a problem.. The SRAM to store the MSI vectors costs die space and making the chip die larger is not an option. So the question is what do you throw out of the chip to get a 10-20x increase in MSI SRAM? This is why using host memory is so appealing. It is economically/functionally better. I'm going to guess other HW is in the same situation, virtualization is really pushing up the number of required IRQs. > >> How is that supposed to work if interrupt remapping is disabled? > > > > The best we can do is issue a command to the device and spin/sleep > > until completion. The device will serialize everything internally. > > > > If the device has died the driver has code to detect and trigger a > > PCI function reset which will definitely stop the interrupt. > > If that interrupt is gone into storm mode for some reason then this will > render your machine unusable before you can do that. Yes, but in general the HW design is to have one-shot interrupts, it would have to be well off the rails to storm. The kind of off the rails where it could also be doing crazy stuff on PCI-E that would be very harmful. > > So, the implementation of these functions would be to push any change > > onto a command queue, trigger the device to DMA the command, spin/sleep > > until the device returns a response and then continue on. If the > > device doesn't return a response in a time window then trigger a WQ to > > do a full device reset. > > I really don't want to do that with the irq descriptor lock held or in > case of affinity from the interrupt handler as we have to do with PCI > MSI/MSI-X due to the horrors of the X86 interrupt delivery trainwreck. > Also you cannot call into command queue code from interrupt disabled and > interrupt descriptor lock held sections. You can try, but lockdep will > yell at you immediately. Yes, I wouldn't want to do this from an IRQ. > One question is whether the device can see partial updates to that > memory due to the async 'swap' of context from the device CPU. It is worse than just partial updates.. The device operation is much more like you'd imagine a CPU cache. There could be copies of the RAM in the device for long periods of time, dirty data in the device that will flush back to CPU RAM overwriting CPU changes, etc. Without involving the device there is just no way to create data consistency, and no way to change the data from the CPU. This is the down side of having device data in the RAM. It cannot be so simple as 'just fetch it every time before you use it' as performance would be horrible. > irq chips have already a mechanism in place to deal with stuff which > cannot be handled from within the irq descriptor spinlock held and > interrupt disabled section. > > The mechanism was invented to deal with interrupt chips which are > connected to i2c, spi, etc.. The access to an interrupt chip control > register has to queue stuff on the bus and wait for completion. > Obviously not what you can do from interrupt disabled, raw spinlock held > context either. Ah intersting, sounds like the right parts! I didn't know about this.. > Now coming back to affinity setting. I'd love to avoid adding the bus > lock magic to those interfaces because until now they can be called and > are called from atomic contexts. And obviously none of the devices which > use the buslock magic support affinity setting because they all deliver > a single interrupt to a demultiplex interrupt and that one is usually > sitting at the CPU level where interrupt steering works. > > If we really can get away with atomically updating the message as > outlined above and just let it happen at some point in the future then > most problems are solved, except for the nastyness of CPU hotplug. Since we can't avoid a device command, I'm think more along the lines of having the affinity update trigger an async WQ to issue the command from a thread context. Since it doesn't need to be synchronous it can make it out 'eventually'. I suppose the core code could provide this as a service? Sort of a varient of the other lazy things above? > But that's actually a non issue. Nothing prevents us from having an > early 'migrate interrupts away from the outgoing CPU hotplug state' > which runs in thread context and can therefore utilize the buslock > mechanism. Actually I was thinking about that for other reasons already. That would certainly work well, seems like it fits with the other lazy/sleeping stuff above as well. > >> If interrupt remapping is enabled then both are trivial because then the > >> irq chip can delegate everything to the parent chip, i.e. the remapping > >> unit. > > > > I did like this notion that IRQ remapping could avoid the overhead of > > spin/spleep. Most of the use cases we have for this will require the > > IOMMU anyhow. > > You still need to support !remap scenarios I fear. For x86 I think we could accept linking this to IOMMU, if really necessary. But it would have to work with ARM - is remapping a x86 only thing? Does ARM put the affinity in the GIC tables not in the MSI data? > Let me summarize what I think would be the sane solution for this: > > 1) Utilize atomic writes for either all 16 bytes or reorder the bytes > and update 8 bytes atomically which is sufficient as the wide > address is only used with irq remapping and the MSI message in the > device is never changed after startup. Sadly not something the device can manage due to data coherence > 2) No requirement for issuing a command for regular migration > operations as they have no requirements to be synchronous. > > Eventually store some state to force a reload on the next regular > queue operation. Would the async version above be OK? > 3) No requirement for issuing a command for mask and unmask operations. > The core code uses and handles lazy masking already. So if the > hardware causes the lazyness, so be it. This lazy masking thing sounds good, I'm totally unfamiliar with it though. > 4) Issue commands for startup and teardown as they need to be > synchronous Yep > 5) Have an early migration state for CPU hotunplug which issues a > command from appropriate context. That would even allow to handle > queue shutdown for managed interrupts when the last CPU in the > managed affinity set goes down. Restart of such a managed interrupt > when the first CPU in an affinity set comes online again would only > need minor modifications of the existing code to make it work. Yep > Thoughts? This email is super helpful, I definately don't know all these corners of the IRQ subsystem as my past with it has mostly been SOC stuff that isn't as complicated! Thanks, Jason
Jason, On Fri, Aug 21 2020 at 21:51, Jason Gunthorpe wrote: > On Sat, Aug 22, 2020 at 01:47:12AM +0200, Thomas Gleixner wrote: >> > If the device has died the driver has code to detect and trigger a >> > PCI function reset which will definitely stop the interrupt. >> >> If that interrupt is gone into storm mode for some reason then this will >> render your machine unusable before you can do that. > > Yes, but in general the HW design is to have one-shot interrupts, it > would have to be well off the rails to storm. The kind of off the > rails where it could also be doing crazy stuff on PCI-E that would be > very harmful. Yeah. One shot should prevent most of the wreckage. I just wanted to spell it out. >> One question is whether the device can see partial updates to that >> memory due to the async 'swap' of context from the device CPU. > > It is worse than just partial updates.. The device operation is much > more like you'd imagine a CPU cache. There could be copies of the RAM > in the device for long periods of time, dirty data in the device that > will flush back to CPU RAM overwriting CPU changes, etc. TBH, that's insane. You clearly want to think about this some more. If you swap out device state and device control state then you definitly want to have regions which are read only from the device POV and never written back. The MSI msg store clearly belongs into that category. But that's not restricted to the MSI msg store, there is certainly other stuff which never wants to be written back by the device. If you don't do that then you simply can't write to that space from the CPU and you have to transport this kind information always via command queues. But that does not make sense. It's trivial enough to have | RO state | | RW state | and on swap in the whole thing is DMA'd into the device and on swap out only the RW state part. It's not rocket science and makes a huge amount of sense. > Without involving the device there is just no way to create data > consistency, and no way to change the data from the CPU. > > This is the down side of having device data in the RAM. It cannot be > so simple as 'just fetch it every time before you use it' as > performance would be horrible. That's clear, but with a proper seperation like the above and some extra mechanism which allows you to tickle a relaod of 'RO state' you can avoid quite some of the problems which you create otherwise. >> If we really can get away with atomically updating the message as >> outlined above and just let it happen at some point in the future then >> most problems are solved, except for the nastyness of CPU hotplug. > > Since we can't avoid a device command, I'm think more along the lines > of having the affinity update trigger an async WQ to issue the command > from a thread context. Since it doesn't need to be synchronous it can > make it out 'eventually'. > > I suppose the core code could provide this as a service? Sort of a > varient of the other lazy things above? Kinda. That needs a lot of thought for the affinity setting stuff because it can be called from contexts which do not allow that. It's solvable though, but I clearly need to stare at the corner cases for a while. > But it would have to work with ARM - is remapping a x86 only thing? No. ARM64 has that as well. > Does ARM put the affinity in the GIC tables not in the MSI data? IIRC, yes. >> Let me summarize what I think would be the sane solution for this: >> >> 1) Utilize atomic writes for either all 16 bytes or reorder the bytes >> and update 8 bytes atomically which is sufficient as the wide >> address is only used with irq remapping and the MSI message in the >> device is never changed after startup. > > Sadly not something the device can manage due to data coherence I disagree :) >> 2) No requirement for issuing a command for regular migration >> operations as they have no requirements to be synchronous. >> >> Eventually store some state to force a reload on the next regular >> queue operation. > > Would the async version above be OK? Async is fine in any variant (except for hotplug). Though having an async WQ or whatever there needs some thought. >> 3) No requirement for issuing a command for mask and unmask operations. >> The core code uses and handles lazy masking already. So if the >> hardware causes the lazyness, so be it. > > This lazy masking thing sounds good, I'm totally unfamiliar with it > though. It's used to avoid irq chip (often MMIO) access in scenarios where disable/enable of an interrupt line happens with high frequency. Serial has that issue. So we mark it disabled, but do not mask it and the core can handle that and masks it once an interrupt comes in in masked state. That obviously does not work out of the box to protect against not disabled but masked state, but conceptually it's a similar problem and can be made work without massive changes. OTOH, in normal operation for MSI interrupts (edge type) masking is not used at all and just restricted to the startup teardown. But I clearly need to think about it with a more awake brain some more. > This email is super helpful, I definately don't know all these corners > of the IRQ subsystem as my past with it has mostly been SOC stuff that > isn't as complicated! It's differently complicated and not less horrible :) Thanks, tglx
On Sat, Aug 22, 2020 at 03:34:45AM +0200, Thomas Gleixner wrote: > >> One question is whether the device can see partial updates to that > >> memory due to the async 'swap' of context from the device CPU. > > > > It is worse than just partial updates.. The device operation is much > > more like you'd imagine a CPU cache. There could be copies of the RAM > > in the device for long periods of time, dirty data in the device that > > will flush back to CPU RAM overwriting CPU changes, etc. > > TBH, that's insane. You clearly want to think about this some > more. If I think this general design is around 15 years old, across a healthy number of silicon generations, and rather a lager number of shipped devices. People have thought about it :) > you swap out device state and device control state then you definitly > want to have regions which are read only from the device POV and never > written back. It is not as useful as you'd think - the issue with atomicity of update still largely prevents doing much useful from the CPU, and to make any CPU side changes visible a device command would still be needed to synchronize the internal state to that modified memory. So, CPU centric updates would cover a very limited number of operations, and a device command is required anyhow. Little is actually gained. > The MSI msg store clearly belongs into that category. > But that's not restricted to the MSI msg store, there is certainly other > stuff which never wants to be written back by the device. To get a design where you'd be able to run everything from a CPU atomic context that can't trigger a WQ.. New silicon would have to implement some MSI-only 'cache' that can invalidate entries based on a simple MemWr TLP. Then the affinity update would write to the host memory, then send a MemWr to the device to trigger invalidate. As a silicon design it might work, but it means existing devices can't be used with this dev_msi. It is also the sort of thing that would need a standard document to have any hope of multiple vendors fitting into it. Eg at PCI-SIG or something. > If you don't do that then you simply can't write to that space from the > CPU and you have to transport this kind information always via command > queues. Yes, exactly. This is part of the architectural design of the device, has been for a long time. Has positives and negatives. > > I suppose the core code could provide this as a service? Sort of a > > varient of the other lazy things above? > > Kinda. That needs a lot of thought for the affinity setting stuff > because it can be called from contexts which do not allow that. It's > solvable though, but I clearly need to stare at the corner cases for a > while. If possible, this would be ideal, as we could use the dev_msi on a big installed base of existing HW. I suspect other HW can probably fit into this too as the basic ingredients should be fairly widespread. Even a restricted version for situations where affinity does not need a device update would possibly be interesting (eg x86 IOMMU remap, ARM GIC, etc) > OTOH, in normal operation for MSI interrupts (edge type) masking is not > used at all and just restricted to the startup teardown. Yeah, at least this device doesn't need masking at runtime, just startup/teardown and affinity update. Thanks, Jason
On Sat, Aug 22 2020 at 20:05, Jason Gunthorpe wrote: > On Sat, Aug 22, 2020 at 03:34:45AM +0200, Thomas Gleixner wrote: > As a silicon design it might work, but it means existing devices can't > be used with this dev_msi. It is also the sort of thing that would > need a standard document to have any hope of multiple vendors fitting > into it. Eg at PCI-SIG or something. Fair enough. >> If you don't do that then you simply can't write to that space from the >> CPU and you have to transport this kind information always via command >> queues. > > Yes, exactly. This is part of the architectural design of the device, > has been for a long time. Has positives and negatives. As always and it clearly follows the general HW design rule "we can fix that in software". >> > I suppose the core code could provide this as a service? Sort of a >> > varient of the other lazy things above? >> >> Kinda. That needs a lot of thought for the affinity setting stuff >> because it can be called from contexts which do not allow that. It's >> solvable though, but I clearly need to stare at the corner cases for a >> while. > > If possible, this would be ideal, as we could use the dev_msi on a big > installed base of existing HW. I'll have a look, but I'm surely not going to like the outcome. Thanks, tglx
--- a/drivers/irqchip/Kconfig +++ b/drivers/irqchip/Kconfig @@ -571,4 +571,12 @@ config LOONGSON_PCH_MSI help Support for the Loongson PCH MSI Controller. +config IMS_MSI + bool "IMS Interrupt Message Store MSI controller" + depends on PCI + select DEVICE_MSI + help + Support for IMS Interrupt Message Store MSI controller + with IMS slot storage in a slot array + endmenu --- a/drivers/irqchip/Makefile +++ b/drivers/irqchip/Makefile @@ -111,3 +111,4 @@ obj-$(CONFIG_LOONGSON_HTPIC) += irq-loo obj-$(CONFIG_LOONGSON_HTVEC) += irq-loongson-htvec.o obj-$(CONFIG_LOONGSON_PCH_PIC) += irq-loongson-pch-pic.o obj-$(CONFIG_LOONGSON_PCH_MSI) += irq-loongson-pch-msi.o +obj-$(CONFIG_IMS_MSI) += irq-ims-msi.o --- /dev/null +++ b/drivers/irqchip/irq-ims-msi.c @@ -0,0 +1,169 @@ +// SPDX-License-Identifier: GPL-2.0 +// (C) Copyright 2020 Thomas Gleixner <tglx@linutronix.de> +/* + * Shared interrupt chip and irq domain for Intel IMS devices + */ +#include <linux/device.h> +#include <linux/slab.h> +#include <linux/msi.h> +#include <linux/irq.h> + +#include <linux/irqchip/irq-ims-msi.h> + +struct ims_data { + struct ims_array_info info; + unsigned long map[0]; +}; + +static void ims_mask_irq(struct irq_data *data) +{ + struct msi_desc *desc = irq_data_get_msi_desc(data); + struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem; + u32 __iomem *ctrl = &slot->ctrl; + + iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl); +} + +static void ims_unmask_irq(struct irq_data *data) +{ + struct msi_desc *desc = irq_data_get_msi_desc(data); + struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem; + u32 __iomem *ctrl = &slot->ctrl; + + iowrite32(ioread32(ctrl) | IMS_VECTOR_CTRL_UNMASK, ctrl); +} + +static void ims_write_msi_msg(struct irq_data *data, struct msi_msg *msg) +{ + struct msi_desc *desc = irq_data_get_msi_desc(data); + struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem; + + iowrite32(msg->address_lo, &slot->address_lo); + iowrite32(msg->address_hi, &slot->address_hi); + iowrite32(msg->data, &slot->data); +} + +static const struct irq_chip ims_msi_controller = { + .name = "IMS", + .irq_mask = ims_mask_irq, + .irq_unmask = ims_unmask_irq, + .irq_write_msi_msg = ims_write_msi_msg, + .irq_retrigger = irq_chip_retrigger_hierarchy, + .flags = IRQCHIP_SKIP_SET_WAKE, +}; + +static void ims_reset_slot(struct ims_array_slot __iomem *slot) +{ + iowrite32(0, &slot->address_lo); + iowrite32(0, &slot->address_hi); + iowrite32(0, &slot->data); + iowrite32(0, &slot->ctrl); +} + +static void ims_free_msi_store(struct irq_domain *domain, struct device *dev) +{ + struct msi_domain_info *info = domain->host_data; + struct ims_data *ims = info->data; + struct msi_desc *entry; + + for_each_msi_entry(entry, dev) { + if (entry->device_msi.priv_iomem) { + clear_bit(entry->device_msi.hwirq, ims->map); + ims_reset_slot(entry->device_msi.priv_iomem); + entry->device_msi.priv_iomem = NULL; + entry->device_msi.hwirq = 0; + } + } +} + +static int ims_alloc_msi_store(struct irq_domain *domain, struct device *dev, + int nvec) +{ + struct msi_domain_info *info = domain->host_data; + struct ims_data *ims = info->data; + struct msi_desc *entry; + + for_each_msi_entry(entry, dev) { + unsigned int idx; + + idx = find_first_zero_bit(ims->map, ims->info.max_slots); + if (idx >= ims->info.max_slots) + goto fail; + set_bit(idx, ims->map); + entry->device_msi.priv_iomem = &ims->info.slots[idx]; + entry->device_msi.hwirq = idx; + } + return 0; + +fail: + ims_free_msi_store(domain, dev); + return -ENOSPC; +} + +struct ims_domain_template { + struct msi_domain_ops ops; + struct msi_domain_info info; +}; + +static const struct ims_domain_template ims_domain_template = { + .ops = { + .msi_alloc_store = ims_alloc_msi_store, + .msi_free_store = ims_free_msi_store, + }, + .info = { + .flags = MSI_FLAG_USE_DEF_DOM_OPS | + MSI_FLAG_USE_DEF_CHIP_OPS, + .handler = handle_edge_irq, + .handler_name = "edge", + }, +}; + +struct irq_domain * +pci_ims_create_msi_irq_domain(struct pci_dev *pdev, + struct ims_array_info *ims_info) +{ + struct ims_domain_template *info; + struct irq_domain *domain; + struct irq_chip *chip; + struct ims_data *data; + unsigned int size; + + /* Allocate new domain storage */ + info = kmemdup(&ims_domain_template, sizeof(ims_domain_template), + GFP_KERNEL); + if (!info) + return NULL; + /* Link the ops */ + info->info.ops = &info->ops; + + /* Allocate ims_info along with the bitmap */ + size = sizeof(*data); + size += BITS_TO_LONGS(ims_info->max_slots) * sizeof(unsigned long); + data = kzalloc(size, GFP_KERNEL); + if (!data) + goto err_info; + + data->info = *ims_info; + info->info.data = data; + + chip = kmemdup(&ims_msi_controller, sizeof(ims_msi_controller), + GFP_KERNEL); + if (!chip) + goto err_data; + info->info.chip = chip; + + domain = pci_subdevice_msi_create_irq_domain(pdev, &info->info); + if (!domain) + goto err_chip; + + return domain; + +err_chip: + kfree(chip); +err_data: + kfree(data); +err_info: + kfree(info); + return NULL; +} +EXPORT_SYMBOL_GPL(pci_ims_create_msi_irq_domain); --- /dev/null +++ b/include/linux/irqchip/irq-ims-msi.h @@ -0,0 +1,41 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* (C) Copyright 2020 Thomas Gleixner <tglx@linutronix.de> */ + +#ifndef _LINUX_IRQCHIP_IRQ_IMS_MSI_H +#define _LINUX_IRQCHIP_IRQ_IMS_MSI_H + +#include <linux/types.h> + +struct ims_array_slot { + u32 address_lo; + u32 address_hi; + u32 data; + u32 ctrl; +}; + +/* Bit to unmask the interrupt in slot->ctrl */ +#define IMS_VECTOR_CTRL_UNMASK 0x01 + +struct ims_array_info { + struct ims_array_slot __iomem *slots; + unsigned int max_slots; +}; + +/* Dummy forward declaration for illustration */ +struct ims_queue_slot; + +/** + * ims_msi_store - Interrupt Message Store descriptor data + * @array_slot: Pointer to a on device IMS storage array slot + * @queue_slot: Pointer to storage embedded in queue data + * @hw_irq: Index of the slot or queue. Also hardware irq number + */ +struct ims_msi_store { + union { + struct ims_array_slot __iomem *array_slot; + struct ims_queue_slot *queue_slot; + }; + unsigned int hw_irq; +}; + +#endif
A generic IMS irq chip and irq domain implementation for IMS based devices which utilize a MSI message store array on chip. Allows IMS devices with a MSI message store array to reuse this code for different array sizes. Allocation and freeing of interrupts happens via the generic msi_domain_alloc/free_irqs() interface. No special purpose IMS magic required as long as the interrupt domain is stored in the underlying device struct. Completely untested of course and mostly for illustration and educational purpose. This should of course be a modular irq chip, but adding that support is left as an exercise for the people who care about this deeply. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <maz@kernel.org> Cc: Megha Dey <megha.dey@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Jacob Pan <jacob.jun.pan@intel.com> Cc: Baolu Lu <baolu.lu@intel.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> --- drivers/irqchip/Kconfig | 8 + drivers/irqchip/Makefile | 1 drivers/irqchip/irq-ims-msi.c | 169 ++++++++++++++++++++++++++++++++++++ include/linux/irqchip/irq-ims-msi.h | 41 ++++++++ 4 files changed, 219 insertions(+)