diff mbox series

[RFC,38/38] irqchip: Add IMS array driver - NOT FOR MERGING

Message ID 20200821002949.049867339@linutronix.de (mailing list archive)
State Superseded
Headers show
Series x86, PCI, XEN, genirq ...: Prepare for device MSI | expand

Commit Message

Thomas Gleixner Aug. 21, 2020, 12:25 a.m. UTC
A generic IMS irq chip and irq domain implementation for IMS based devices
which utilize a MSI message store array on chip.

Allows IMS devices with a MSI message store array to reuse this code for
different array sizes.

Allocation and freeing of interrupts happens via the generic
msi_domain_alloc/free_irqs() interface. No special purpose IMS magic
required as long as the interrupt domain is stored in the underlying device
struct.

Completely untested of course and mostly for illustration and educational
purpose. This should of course be a modular irq chip, but adding that
support is left as an exercise for the people who care about this deeply.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Megha Dey <megha.dey@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Jacob Pan <jacob.jun.pan@intel.com>
Cc: Baolu Lu <baolu.lu@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 drivers/irqchip/Kconfig             |    8 +
 drivers/irqchip/Makefile            |    1 
 drivers/irqchip/irq-ims-msi.c       |  169 ++++++++++++++++++++++++++++++++++++
 include/linux/irqchip/irq-ims-msi.h |   41 ++++++++
 4 files changed, 219 insertions(+)

Comments

Jason Gunthorpe Aug. 21, 2020, 12:45 p.m. UTC | #1
On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote:
> +static void ims_mask_irq(struct irq_data *data)
> +{
> +	struct msi_desc *desc = irq_data_get_msi_desc(data);
> +	struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
> +	u32 __iomem *ctrl = &slot->ctrl;
> +
> +	iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);

Just to be clear, this is exactly the sort of operation we can't do
with non-MSI interrupts. For a real PCI device to execute this it
would have to keep the data on die.

I saw the idxd driver was doing something like this, I assume it
avoids trouble because it is a fake PCI device integrated with the
CPU, not on a real PCI bus?

It is really nice to see irq_domain used properly in x86!

Thanks,
Jason
Thomas Gleixner Aug. 21, 2020, 7:47 p.m. UTC | #2
On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote:
> On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote:
>> +static void ims_mask_irq(struct irq_data *data)
>> +{
>> +	struct msi_desc *desc = irq_data_get_msi_desc(data);
>> +	struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
>> +	u32 __iomem *ctrl = &slot->ctrl;
>> +
>> +	iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);
>
> Just to be clear, this is exactly the sort of operation we can't do
> with non-MSI interrupts. For a real PCI device to execute this it
> would have to keep the data on die.

We means NVIDIA and your new device, right?

So if I understand correctly then the queue memory where the MSI
descriptor sits is in RAM.

How is that supposed to work if interrupt remapping is disabled?

That means irq migration and proper disabling of an interrupt become an
interesting exercise. I'm so not looking forward to that.

If interrupt remapping is enabled then both are trivial because then the
irq chip can delegate everything to the parent chip, i.e. the remapping
unit.

Can you please explain that a bit more precise?

> I saw the idxd driver was doing something like this, I assume it
> avoids trouble because it is a fake PCI device integrated with the
> CPU, not on a real PCI bus?

That's how it is implemented as far as I understood the patches. It's
device memory therefore iowrite32().

> It is really nice to see irq_domain used properly in x86!

If you ignore the abuse in XEN :)

And to be fair proper and usable (hierarchical) irq domains originate
from x86 and happened to solve quite a few horrorshows on the ARM side.

Just back then when we converted the original maze, nobody had a good
idea and the stomach to touch XEN.

Thanks,

        tglx
Jason Gunthorpe Aug. 21, 2020, 8:17 p.m. UTC | #3
On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:
> On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote:
> > On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote:
> >> +static void ims_mask_irq(struct irq_data *data)
> >> +{
> >> +	struct msi_desc *desc = irq_data_get_msi_desc(data);
> >> +	struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
> >> +	u32 __iomem *ctrl = &slot->ctrl;
> >> +
> >> +	iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);
> >
> > Just to be clear, this is exactly the sort of operation we can't do
> > with non-MSI interrupts. For a real PCI device to execute this it
> > would have to keep the data on die.
> 
> We means NVIDIA and your new device, right?

We'd like to use this in the current Mellanox NIC HW, eg the mlx5
driver. (NVIDIA acquired Mellanox recently)

> So if I understand correctly then the queue memory where the MSI
> descriptor sits is in RAM.

Yes, IMHO that is the whole point of this 'IMS' stuff. If devices
could have enough on-die memory then they could just use really big
MSI-X tables. Currently due to on-die memory constraints mlx5 is
limited to a few hundred MSI-X vectors.

Since MSI-X tables are exposed via MMIO they can't be 'swapped' to
RAM.

Moving away from MSI-X's MMIO access model allows them to be swapped
to RAM. The cost is that accessing them for update is a
command/response operation not a MMIO operation.

The HW is already swapping the queues causing the interrupts to RAM,
so adding a bit of additional data to store the MSI addr/data is
reasonable.

To give some sense, a 'working set' for the NIC device in some cases
can be hundreds of megabytes of data. System RAM is used to store
this, and precious on-die memory holds some dynamic active set, much
like a processor cache.

> How is that supposed to work if interrupt remapping is disabled?

The best we can do is issue a command to the device and spin/sleep
until completion. The device will serialize everything internally.

If the device has died the driver has code to detect and trigger a
PCI function reset which will definitely stop the interrupt.

So, the implementation of these functions would be to push any change
onto a command queue, trigger the device to DMA the command, spin/sleep
until the device returns a response and then continue on. If the
device doesn't return a response in a time window then trigger a WQ to
do a full device reset.

The spin/sleep is only needed if the update has to be synchronous, so
things like rebalancing could just push the rebalancing work and
immediately return.

> If interrupt remapping is enabled then both are trivial because then the
> irq chip can delegate everything to the parent chip, i.e. the remapping
> unit.

I did like this notion that IRQ remapping could avoid the overhead of
spin/spleep. Most of the use cases we have for this will require the
IOMMU anyhow.

> > I saw the idxd driver was doing something like this, I assume it
> > avoids trouble because it is a fake PCI device integrated with the
> > CPU, not on a real PCI bus?
> 
> That's how it is implemented as far as I understood the patches. It's
> device memory therefore iowrite32().

I don't know anything about idxd.. Given the scale of interrupt need I
assumed the idxd HW had some hidden swapping to RAM. 

Since it is on-die with the CPU there are a bunch of ways I could
imagine Intel could make MMIO triggered swapping work that are not
available to a true PCI-E device.

Jason
Thomas Gleixner Aug. 21, 2020, 11:47 p.m. UTC | #4
On Fri, Aug 21 2020 at 17:17, Jason Gunthorpe wrote:
> On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:
>> So if I understand correctly then the queue memory where the MSI
>> descriptor sits is in RAM.
>
> Yes, IMHO that is the whole point of this 'IMS' stuff. If devices
> could have enough on-die memory then they could just use really big
> MSI-X tables. Currently due to on-die memory constraints mlx5 is
> limited to a few hundred MSI-X vectors.

Right, that's the limit of a particular device, but nothing prevents you
to have a larger table on a new device.

The MSI-X limitation to 2048 is defined by the PCI spec and you'd need
either some non spec compliant abuse of the reserved size bits or some
extra config entry. So IMS is a way to work around that. But I
understand why you want to move them to main memory, but you have to
deal with the problems this creates upfront.

> Since MSI-X tables are exposed via MMIO they can't be 'swapped' to
> RAM.
>
> Moving away from MSI-X's MMIO access model allows them to be swapped
> to RAM. The cost is that accessing them for update is a
> command/response operation not a MMIO operation.
>
> The HW is already swapping the queues causing the interrupts to RAM,
> so adding a bit of additional data to store the MSI addr/data is
> reasonable.

Makes sense.

>> How is that supposed to work if interrupt remapping is disabled?
>
> The best we can do is issue a command to the device and spin/sleep
> until completion. The device will serialize everything internally.
>
> If the device has died the driver has code to detect and trigger a
> PCI function reset which will definitely stop the interrupt.

If that interrupt is gone into storm mode for some reason then this will
render your machine unusable before you can do that.

> So, the implementation of these functions would be to push any change
> onto a command queue, trigger the device to DMA the command, spin/sleep
> until the device returns a response and then continue on. If the
> device doesn't return a response in a time window then trigger a WQ to
> do a full device reset.

I really don't want to do that with the irq descriptor lock held or in
case of affinity from the interrupt handler as we have to do with PCI
MSI/MSI-X due to the horrors of the X86 interrupt delivery trainwreck.
Also you cannot call into command queue code from interrupt disabled and
interrupt descriptor lock held sections. You can try, but lockdep will
yell at you immediately. 

There is also CPU hotplug where we have to force migrate an interrupt
away from an outgoing CPU. This needs some serious thought.

One question is whether the device can see partial updates to that
memory due to the async 'swap' of context from the device CPU.

So we have address_lo, address_hi, data and ctrl. Each of them 32 bit.

address_hi is only relevant when the number of CPUs is > 255 which
requires X2APIC which in turn requires interrupt remapping. For all
others the address_hi value never changes. Let's ignore that case for
now, but see further down.

So what's interesting is address_lo and data. If the device sees an
partial update, i.e. address_lo is written and the device grabs the
update before data is written then the next interrupt will end up in
lala land. We have code for that in place in msi_set_affinity() in
arch/x86/kernel/apic/msi.c. Get eyecancer protection glasses before
opening that and keep beer ready to wipe out the horrors immediately
afterwards.

If the device updates the data only when a command is issued then this
is not a problem, but it causes other problems because you still cannot
access the command queue from that context. This makes it even worse for
the CPU hotplug case. But see all of the reasoning on that.

If it takes whatever it sees while grabbing the state when switching to
a different queue or at the point of actual interrupt delivery, then you
have a problem. Not only you, I'm going to have one as well because I'm
going to be the poor sod to come up with the workaround.

So we better address that _before_ you start deploying this piece of
art. I'm not really interested in another slighly different and probably
more horrible version of the same story. Don't blame me, it's the way
how Intel decided to make this "work".

There are a couple of options to ensure that the device will never see
inconsistent state:

      1) Use a locked 16 byte wide operation (cpmxchg16) which is not
         available on 32bit

      2) Order the MSG entry differently in the queue storage:

         u32 address_lo
         u32 data
         u32 address_hi
         u32 ctrl

         And then enforce an 8 byte store on 64 bit which is guaranteed
         to be atomic vs. other CPUs and bus agents, i.e. DMA.

         I said enforce because compilers are known to do stupid things.

Both are fine for me and the only caveat is that the access does not go
accross a cache line boundary. The restriction to 64bit shouldn't be a
problem either. Running such a device on 32bit causes more problems than
it solves :)

> The spin/sleep is only needed if the update has to be synchronous, so
> things like rebalancing could just push the rebalancing work and
> immediately return.

Interrupt migration is async anyway. An interrupt might have been sent
to the old vector just before the new vector was written. That's already
dealt with. The old vector is cleaned up when the first interrupt
arrives on the new vector which is the most reliable indicator that it's
done.

In that case you can avoid issuing a command, but that needs some
thought as well when the queue data is never reloaded. But you can mark
the queue that affinity has changed and let the next operation on the
queue (RX, TX, whatever) which needs to talk to the device anyway deal
with it, i.e. set some command flag in the next operation which tells
the queue to reload that message.

The only exception is CPU hotplug, but I have an idea how to deal with
that.

Aside of that some stuff want's to be synchronous though. e.g. shutdown,
startup.

irq chips have already a mechanism in place to deal with stuff which
cannot be handled from within the irq descriptor spinlock held and
interrupt disabled section.

The mechanism was invented to deal with interrupt chips which are
connected to i2c, spi, etc.. The access to an interrupt chip control
register has to queue stuff on the bus and wait for completion.
Obviously not what you can do from interrupt disabled, raw spinlock held
context either.

So we have for most operations (except affinity setting) the concept of
update on lock release. For these devices the interrupt chip which
handles all lines on that controller on the slow bus has an additional
lock, called bus lock. The core code does not know about that lock at
all. It's managed at the irq chip side.

The irqchip has two callbacks: irq_bus_lock() and irq_bus_sync_unlock().
irq_bus_lock() is invoked before interrupts are disabled and the
spinlock is taken and irq_bus_sync_unlock() after releasing the spinlock
and reenabling interrupts. The "real" chip operations like mask, unmask
etc. are operating on an chip internal state cache.

For such devices irq_bus_lock() usually takes a sleepable lock (mutex)
to protect the state cache and the update logic over the slow bus.

irq_bus_sync_unlock() releases that lock, but before doing so it checks
whether the operation has changed the state cache and if so it queues a
command on the slow bus and waits for completion.

That makes sure that the device state and the state cache are in sync
before the next operation on a maybe different irq line on the same chip
happens.

Now for your case you might just not have irq_mask()/irq_unmask() callbacks or
simple ones which just update the queue memory in RAM, but then you want
irq_disable()/irq_enable() callbacks which manipulate state cache and
then provide the irq_bus_lock() and irq_bus_sync_unlock() callbacks as
well which do not necessarily need a lock underneath, but the unlock
side implements the 'Queue command and wait for completion' part.

Now coming back to affinity setting. I'd love to avoid adding the bus
lock magic to those interfaces because until now they can be called and
are called from atomic contexts. And obviously none of the devices which
use the buslock magic support affinity setting because they all deliver
a single interrupt to a demultiplex interrupt and that one is usually
sitting at the CPU level where interrupt steering works.

If we really can get away with atomically updating the message as
outlined above and just let it happen at some point in the future then
most problems are solved, except for the nastyness of CPU hotplug.

But that's actually a non issue. Nothing prevents us from having an
early 'migrate interrupts away from the outgoing CPU hotplug state'
which runs in thread context and can therefore utilize the buslock
mechanism. Actually I was thinking about that for other reasons already.

That state would need some thought and consequently some minor changes
to the affinity mask checks to prevent that the interrupt gets migrated
back to the outgoing CPU before that CPU reaches offline state. Nothing
fundamental though.

Just to be clear: We really need to do that at the core level and not
again in some dark place in a driver as that will cause state
inconsistency and hard to debug wreckage.

>> If interrupt remapping is enabled then both are trivial because then the
>> irq chip can delegate everything to the parent chip, i.e. the remapping
>> unit.
>
> I did like this notion that IRQ remapping could avoid the overhead of
> spin/spleep. Most of the use cases we have for this will require the
> IOMMU anyhow.

You still need to support !remap scenarios I fear.

And even for the remap case you need some of that bus lock magic to
handle startup and teardown properly without the usual horrible hacks in
the driver.

If your hard^Wfirmware does the right thing then the only place you need
to worry about the command queueing is startup and teardown and the
extra bit for the early hotplug migration.

Let me summarize what I think would be the sane solution for this:

  1) Utilize atomic writes for either all 16 bytes or reorder the bytes
     and update 8 bytes atomically which is sufficient as the wide
     address is only used with irq remapping and the MSI message in the
     device is never changed after startup.

  2) No requirement for issuing a command for regular migration
     operations as they have no requirements to be synchronous.

     Eventually store some state to force a reload on the next regular
     queue operation.

  3) No requirement for issuing a command for mask and unmask operations.
     The core code uses and handles lazy masking already. So if the
     hardware causes the lazyness, so be it.

  4) Issue commands for startup and teardown as they need to be
     synchronous

  5) Have an early migration state for CPU hotunplug which issues a
     command from appropriate context. That would even allow to handle
     queue shutdown for managed interrupts when the last CPU in the
     managed affinity set goes down. Restart of such a managed interrupt
     when the first CPU in an affinity set comes online again would only
     need minor modifications of the existing code to make it work.
     
Thoughts?

Thanks,

        tglx
Jason Gunthorpe Aug. 22, 2020, 12:51 a.m. UTC | #5
On Sat, Aug 22, 2020 at 01:47:12AM +0200, Thomas Gleixner wrote:
> On Fri, Aug 21 2020 at 17:17, Jason Gunthorpe wrote:
> > On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:
> >> So if I understand correctly then the queue memory where the MSI
> >> descriptor sits is in RAM.
> >
> > Yes, IMHO that is the whole point of this 'IMS' stuff. If devices
> > could have enough on-die memory then they could just use really big
> > MSI-X tables. Currently due to on-die memory constraints mlx5 is
> > limited to a few hundred MSI-X vectors.
> 
> Right, that's the limit of a particular device, but nothing prevents you
> to have a larger table on a new device.

Well, physics are a problem.. The SRAM to store the MSI vectors costs
die space and making the chip die larger is not an option. So the
question is what do you throw out of the chip to get a 10-20x increase
in MSI SRAM?

This is why using host memory is so appealing. It is
economically/functionally better.

I'm going to guess other HW is in the same situation, virtualization
is really pushing up the number of required IRQs.

> >> How is that supposed to work if interrupt remapping is disabled?
> >
> > The best we can do is issue a command to the device and spin/sleep
> > until completion. The device will serialize everything internally.
> >
> > If the device has died the driver has code to detect and trigger a
> > PCI function reset which will definitely stop the interrupt.
> 
> If that interrupt is gone into storm mode for some reason then this will
> render your machine unusable before you can do that.

Yes, but in general the HW design is to have one-shot interrupts, it
would have to be well off the rails to storm. The kind of off the
rails where it could also be doing crazy stuff on PCI-E that would be
very harmful.

> > So, the implementation of these functions would be to push any change
> > onto a command queue, trigger the device to DMA the command, spin/sleep
> > until the device returns a response and then continue on. If the
> > device doesn't return a response in a time window then trigger a WQ to
> > do a full device reset.
> 
> I really don't want to do that with the irq descriptor lock held or in
> case of affinity from the interrupt handler as we have to do with PCI
> MSI/MSI-X due to the horrors of the X86 interrupt delivery trainwreck.
> Also you cannot call into command queue code from interrupt disabled and
> interrupt descriptor lock held sections. You can try, but lockdep will
> yell at you immediately.

Yes, I wouldn't want to do this from an IRQ.

> One question is whether the device can see partial updates to that
> memory due to the async 'swap' of context from the device CPU.

It is worse than just partial updates.. The device operation is much
more like you'd imagine a CPU cache. There could be copies of the RAM
in the device for long periods of time, dirty data in the device that
will flush back to CPU RAM overwriting CPU changes, etc.

Without involving the device there is just no way to create data
consistency, and no way to change the data from the CPU. 

This is the down side of having device data in the RAM. It cannot be
so simple as 'just fetch it every time before you use it' as
performance would be horrible.

> irq chips have already a mechanism in place to deal with stuff which
> cannot be handled from within the irq descriptor spinlock held and
> interrupt disabled section.
> 
> The mechanism was invented to deal with interrupt chips which are
> connected to i2c, spi, etc.. The access to an interrupt chip control
> register has to queue stuff on the bus and wait for completion.
> Obviously not what you can do from interrupt disabled, raw spinlock held
> context either.

Ah intersting, sounds like the right parts! I didn't know about this..

> Now coming back to affinity setting. I'd love to avoid adding the bus
> lock magic to those interfaces because until now they can be called and
> are called from atomic contexts. And obviously none of the devices which
> use the buslock magic support affinity setting because they all deliver
> a single interrupt to a demultiplex interrupt and that one is usually
> sitting at the CPU level where interrupt steering works.
> 
> If we really can get away with atomically updating the message as
> outlined above and just let it happen at some point in the future then
> most problems are solved, except for the nastyness of CPU hotplug.

Since we can't avoid a device command, I'm think more along the lines
of having the affinity update trigger an async WQ to issue the command
from a thread context. Since it doesn't need to be synchronous it can
make it out 'eventually'.

I suppose the core code could provide this as a service? Sort of a
varient of the other lazy things above?

> But that's actually a non issue. Nothing prevents us from having an
> early 'migrate interrupts away from the outgoing CPU hotplug state'
> which runs in thread context and can therefore utilize the buslock
> mechanism. Actually I was thinking about that for other reasons already.

That would certainly work well, seems like it fits with the other
lazy/sleeping stuff above as well.

> >> If interrupt remapping is enabled then both are trivial because then the
> >> irq chip can delegate everything to the parent chip, i.e. the remapping
> >> unit.
> >
> > I did like this notion that IRQ remapping could avoid the overhead of
> > spin/spleep. Most of the use cases we have for this will require the
> > IOMMU anyhow.
> 
> You still need to support !remap scenarios I fear.

For x86 I think we could accept linking this to IOMMU, if really
necessary.

But it would have to work with ARM - is remapping a x86 only thing?
Does ARM put the affinity in the GIC tables not in the MSI data?

> Let me summarize what I think would be the sane solution for this:
> 
>   1) Utilize atomic writes for either all 16 bytes or reorder the bytes
>      and update 8 bytes atomically which is sufficient as the wide
>      address is only used with irq remapping and the MSI message in the
>      device is never changed after startup.

Sadly not something the device can manage due to data coherence

>   2) No requirement for issuing a command for regular migration
>      operations as they have no requirements to be synchronous.
> 
>      Eventually store some state to force a reload on the next regular
>      queue operation.

Would the async version above be OK?

>   3) No requirement for issuing a command for mask and unmask operations.
>      The core code uses and handles lazy masking already. So if the
>      hardware causes the lazyness, so be it.

This lazy masking thing sounds good, I'm totally unfamiliar with it
though.

>   4) Issue commands for startup and teardown as they need to be
>      synchronous

Yep

>   5) Have an early migration state for CPU hotunplug which issues a
>      command from appropriate context. That would even allow to handle
>      queue shutdown for managed interrupts when the last CPU in the
>      managed affinity set goes down. Restart of such a managed interrupt
>      when the first CPU in an affinity set comes online again would only
>      need minor modifications of the existing code to make it work.

Yep

> Thoughts?

This email is super helpful, I definately don't know all these corners
of the IRQ subsystem as my past with it has mostly been SOC stuff that
isn't as complicated!

Thanks,
Jason
Thomas Gleixner Aug. 22, 2020, 1:34 a.m. UTC | #6
Jason,

On Fri, Aug 21 2020 at 21:51, Jason Gunthorpe wrote:
> On Sat, Aug 22, 2020 at 01:47:12AM +0200, Thomas Gleixner wrote:
>> > If the device has died the driver has code to detect and trigger a
>> > PCI function reset which will definitely stop the interrupt.
>> 
>> If that interrupt is gone into storm mode for some reason then this will
>> render your machine unusable before you can do that.
>
> Yes, but in general the HW design is to have one-shot interrupts, it
> would have to be well off the rails to storm. The kind of off the
> rails where it could also be doing crazy stuff on PCI-E that would be
> very harmful.

Yeah. One shot should prevent most of the wreckage. I just wanted to
spell it out.

>> One question is whether the device can see partial updates to that
>> memory due to the async 'swap' of context from the device CPU.
>
> It is worse than just partial updates.. The device operation is much
> more like you'd imagine a CPU cache. There could be copies of the RAM
> in the device for long periods of time, dirty data in the device that
> will flush back to CPU RAM overwriting CPU changes, etc.

TBH, that's insane. You clearly want to think about this some more. If
you swap out device state and device control state then you definitly
want to have regions which are read only from the device POV and never
written back. The MSI msg store clearly belongs into that category.
But that's not restricted to the MSI msg store, there is certainly other
stuff which never wants to be written back by the device.

If you don't do that then you simply can't write to that space from the
CPU and you have to transport this kind information always via command
queues.

But that does not make sense. It's trivial enough to have

    | RO state |
    | RW state |

and on swap in the whole thing is DMA'd into the device and on swap out
only the RW state part. It's not rocket science and makes a huge amount
of sense.

> Without involving the device there is just no way to create data
> consistency, and no way to change the data from the CPU. 
>
> This is the down side of having device data in the RAM. It cannot be
> so simple as 'just fetch it every time before you use it' as
> performance would be horrible.

That's clear, but with a proper seperation like the above and some extra
mechanism which allows you to tickle a relaod of 'RO state' you can
avoid quite some of the problems which you create otherwise.

>> If we really can get away with atomically updating the message as
>> outlined above and just let it happen at some point in the future then
>> most problems are solved, except for the nastyness of CPU hotplug.
>
> Since we can't avoid a device command, I'm think more along the lines
> of having the affinity update trigger an async WQ to issue the command
> from a thread context. Since it doesn't need to be synchronous it can
> make it out 'eventually'.
>
> I suppose the core code could provide this as a service? Sort of a
> varient of the other lazy things above?

Kinda. That needs a lot of thought for the affinity setting stuff
because it can be called from contexts which do not allow that. It's
solvable though, but I clearly need to stare at the corner cases for a
while.

> But it would have to work with ARM - is remapping a x86 only thing?

No. ARM64 has that as well.

> Does ARM put the affinity in the GIC tables not in the MSI data?

IIRC, yes.

>> Let me summarize what I think would be the sane solution for this:
>> 
>>   1) Utilize atomic writes for either all 16 bytes or reorder the bytes
>>      and update 8 bytes atomically which is sufficient as the wide
>>      address is only used with irq remapping and the MSI message in the
>>      device is never changed after startup.
>
> Sadly not something the device can manage due to data coherence

I disagree :)

>>   2) No requirement for issuing a command for regular migration
>>      operations as they have no requirements to be synchronous.
>> 
>>      Eventually store some state to force a reload on the next regular
>>      queue operation.
>
> Would the async version above be OK?

Async is fine in any variant (except for hotplug). Though having an
async WQ or whatever there needs some thought.

>>   3) No requirement for issuing a command for mask and unmask operations.
>>      The core code uses and handles lazy masking already. So if the
>>      hardware causes the lazyness, so be it.
>
> This lazy masking thing sounds good, I'm totally unfamiliar with it
> though.

It's used to avoid irq chip (often MMIO) access in scenarios where
disable/enable of an interrupt line happens with high frequency. Serial
has that issue. So we mark it disabled, but do not mask it and the core
can handle that and masks it once an interrupt comes in in masked
state. That obviously does not work out of the box to protect against
not disabled but masked state, but conceptually it's a similar problem
and can be made work without massive changes. 

OTOH, in normal operation for MSI interrupts (edge type) masking is not
used at all and just restricted to the startup teardown.

But I clearly need to think about it with a more awake brain some more.

> This email is super helpful, I definately don't know all these corners
> of the IRQ subsystem as my past with it has mostly been SOC stuff that
> isn't as complicated!

It's differently complicated and not less horrible :)

Thanks,

        tglx
Jason Gunthorpe Aug. 22, 2020, 11:05 p.m. UTC | #7
On Sat, Aug 22, 2020 at 03:34:45AM +0200, Thomas Gleixner wrote:
> >> One question is whether the device can see partial updates to that
> >> memory due to the async 'swap' of context from the device CPU.
> >
> > It is worse than just partial updates.. The device operation is much
> > more like you'd imagine a CPU cache. There could be copies of the RAM
> > in the device for long periods of time, dirty data in the device that
> > will flush back to CPU RAM overwriting CPU changes, etc.
> 
> TBH, that's insane. You clearly want to think about this some
> more. If

I think this general design is around 15 years old, across a healthy
number of silicon generations, and rather a lager number of shipped
devices. People have thought about it :)

> you swap out device state and device control state then you definitly
> want to have regions which are read only from the device POV and never
> written back. 

It is not as useful as you'd think - the issue with atomicity of
update still largely prevents doing much useful from the CPU, and to
make any CPU side changes visible a device command would still be
needed to synchronize the internal state to that modified memory.

So, CPU centric updates would cover a very limited number of
operations, and a device command is required anyhow. Little is
actually gained.

> The MSI msg store clearly belongs into that category.
> But that's not restricted to the MSI msg store, there is certainly other
> stuff which never wants to be written back by the device.

To get a design where you'd be able to run everything from a CPU
atomic context that can't trigger a WQ..

New silicon would have to implement some MSI-only 'cache' that can
invalidate entries based on a simple MemWr TLP.

Then the affinity update would write to the host memory, then send a
MemWr to the device to trigger invalidate.

As a silicon design it might work, but it means existing devices can't
be used with this dev_msi. It is also the sort of thing that would
need a standard document to have any hope of multiple vendors fitting
into it. Eg at PCI-SIG or something.

> If you don't do that then you simply can't write to that space from the
> CPU and you have to transport this kind information always via command
> queues.

Yes, exactly. This is part of the architectural design of the device,
has been for a long time. Has positives and negatives.

> > I suppose the core code could provide this as a service? Sort of a
> > varient of the other lazy things above?
> 
> Kinda. That needs a lot of thought for the affinity setting stuff
> because it can be called from contexts which do not allow that. It's
> solvable though, but I clearly need to stare at the corner cases for a
> while.

If possible, this would be ideal, as we could use the dev_msi on a big
installed base of existing HW.

I suspect other HW can probably fit into this too as the basic
ingredients should be fairly widespread.

Even a restricted version for situations where affinity does not need
a device update would possibly be interesting (eg x86 IOMMU remap, ARM
GIC, etc)

> OTOH, in normal operation for MSI interrupts (edge type) masking is not
> used at all and just restricted to the startup teardown.

Yeah, at least this device doesn't need masking at runtime, just
startup/teardown and affinity update.

Thanks,
Jason
Thomas Gleixner Aug. 23, 2020, 8:03 a.m. UTC | #8
On Sat, Aug 22 2020 at 20:05, Jason Gunthorpe wrote:
> On Sat, Aug 22, 2020 at 03:34:45AM +0200, Thomas Gleixner wrote:
> As a silicon design it might work, but it means existing devices can't
> be used with this dev_msi. It is also the sort of thing that would
> need a standard document to have any hope of multiple vendors fitting
> into it. Eg at PCI-SIG or something.

Fair enough.

>> If you don't do that then you simply can't write to that space from the
>> CPU and you have to transport this kind information always via command
>> queues.
>
> Yes, exactly. This is part of the architectural design of the device,
> has been for a long time. Has positives and negatives.

As always and it clearly follows the general HW design rule "we can fix
that in software".

>> > I suppose the core code could provide this as a service? Sort of a
>> > varient of the other lazy things above?
>> 
>> Kinda. That needs a lot of thought for the affinity setting stuff
>> because it can be called from contexts which do not allow that. It's
>> solvable though, but I clearly need to stare at the corner cases for a
>> while.
>
> If possible, this would be ideal, as we could use the dev_msi on a big
> installed base of existing HW.

I'll have a look, but I'm surely not going to like the outcome.

Thanks,

        tglx
diff mbox series

Patch

--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -571,4 +571,12 @@  config LOONGSON_PCH_MSI
 	help
 	  Support for the Loongson PCH MSI Controller.
 
+config IMS_MSI
+	bool "IMS Interrupt Message Store MSI controller"
+	depends on PCI
+	select DEVICE_MSI
+	help
+	  Support for IMS Interrupt Message Store MSI controller
+	  with IMS slot storage in a slot array
+
 endmenu
--- a/drivers/irqchip/Makefile
+++ b/drivers/irqchip/Makefile
@@ -111,3 +111,4 @@  obj-$(CONFIG_LOONGSON_HTPIC)		+= irq-loo
 obj-$(CONFIG_LOONGSON_HTVEC)		+= irq-loongson-htvec.o
 obj-$(CONFIG_LOONGSON_PCH_PIC)		+= irq-loongson-pch-pic.o
 obj-$(CONFIG_LOONGSON_PCH_MSI)		+= irq-loongson-pch-msi.o
+obj-$(CONFIG_IMS_MSI)			+= irq-ims-msi.o
--- /dev/null
+++ b/drivers/irqchip/irq-ims-msi.c
@@ -0,0 +1,169 @@ 
+// SPDX-License-Identifier: GPL-2.0
+// (C) Copyright 2020 Thomas Gleixner <tglx@linutronix.de>
+/*
+ * Shared interrupt chip and irq domain for Intel IMS devices
+ */
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/msi.h>
+#include <linux/irq.h>
+
+#include <linux/irqchip/irq-ims-msi.h>
+
+struct ims_data {
+	struct ims_array_info	info;
+	unsigned long		map[0];
+};
+
+static void ims_mask_irq(struct irq_data *data)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+	struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
+	u32 __iomem *ctrl = &slot->ctrl;
+
+	iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);
+}
+
+static void ims_unmask_irq(struct irq_data *data)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+	struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
+	u32 __iomem *ctrl = &slot->ctrl;
+
+	iowrite32(ioread32(ctrl) | IMS_VECTOR_CTRL_UNMASK, ctrl);
+}
+
+static void ims_write_msi_msg(struct irq_data *data, struct msi_msg *msg)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+	struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
+
+	iowrite32(msg->address_lo, &slot->address_lo);
+	iowrite32(msg->address_hi, &slot->address_hi);
+	iowrite32(msg->data, &slot->data);
+}
+
+static const struct irq_chip ims_msi_controller = {
+	.name			= "IMS",
+	.irq_mask		= ims_mask_irq,
+	.irq_unmask		= ims_unmask_irq,
+	.irq_write_msi_msg	= ims_write_msi_msg,
+	.irq_retrigger		= irq_chip_retrigger_hierarchy,
+	.flags			= IRQCHIP_SKIP_SET_WAKE,
+};
+
+static void ims_reset_slot(struct ims_array_slot __iomem *slot)
+{
+	iowrite32(0, &slot->address_lo);
+	iowrite32(0, &slot->address_hi);
+	iowrite32(0, &slot->data);
+	iowrite32(0, &slot->ctrl);
+}
+
+static void ims_free_msi_store(struct irq_domain *domain, struct device *dev)
+{
+	struct msi_domain_info *info = domain->host_data;
+	struct ims_data *ims = info->data;
+	struct msi_desc *entry;
+
+	for_each_msi_entry(entry, dev) {
+		if (entry->device_msi.priv_iomem) {
+			clear_bit(entry->device_msi.hwirq, ims->map);
+			ims_reset_slot(entry->device_msi.priv_iomem);
+			entry->device_msi.priv_iomem = NULL;
+			entry->device_msi.hwirq = 0;
+		}
+	}
+}
+
+static int ims_alloc_msi_store(struct irq_domain *domain, struct device *dev,
+			       int nvec)
+{
+	struct msi_domain_info *info = domain->host_data;
+	struct ims_data *ims = info->data;
+	struct msi_desc *entry;
+
+	for_each_msi_entry(entry, dev) {
+		unsigned int idx;
+
+		idx = find_first_zero_bit(ims->map, ims->info.max_slots);
+		if (idx >= ims->info.max_slots)
+			goto fail;
+		set_bit(idx, ims->map);
+		entry->device_msi.priv_iomem = &ims->info.slots[idx];
+		entry->device_msi.hwirq = idx;
+	}
+	return 0;
+
+fail:
+	ims_free_msi_store(domain, dev);
+	return -ENOSPC;
+}
+
+struct ims_domain_template {
+	struct msi_domain_ops	ops;
+	struct msi_domain_info	info;
+};
+
+static const struct ims_domain_template ims_domain_template = {
+	.ops = {
+		.msi_alloc_store	= ims_alloc_msi_store,
+		.msi_free_store		= ims_free_msi_store,
+	},
+	.info = {
+		.flags		= MSI_FLAG_USE_DEF_DOM_OPS |
+				  MSI_FLAG_USE_DEF_CHIP_OPS,
+		.handler	= handle_edge_irq,
+		.handler_name	= "edge",
+	},
+};
+
+struct irq_domain *
+pci_ims_create_msi_irq_domain(struct pci_dev *pdev,
+			      struct ims_array_info *ims_info)
+{
+	struct ims_domain_template *info;
+	struct irq_domain *domain;
+	struct irq_chip *chip;
+	struct ims_data *data;
+	unsigned int size;
+
+	/* Allocate new domain storage */
+	info = kmemdup(&ims_domain_template, sizeof(ims_domain_template),
+		       GFP_KERNEL);
+	if (!info)
+		return NULL;
+	/* Link the ops */
+	info->info.ops = &info->ops;
+
+	/* Allocate ims_info along with the bitmap */
+	size = sizeof(*data);
+	size += BITS_TO_LONGS(ims_info->max_slots) * sizeof(unsigned long);
+	data = kzalloc(size, GFP_KERNEL);
+	if (!data)
+		goto err_info;
+
+	data->info = *ims_info;
+	info->info.data = data;
+
+	chip = kmemdup(&ims_msi_controller, sizeof(ims_msi_controller),
+		       GFP_KERNEL);
+	if (!chip)
+		goto err_data;
+	info->info.chip = chip;
+
+	domain = pci_subdevice_msi_create_irq_domain(pdev, &info->info);
+	if (!domain)
+		goto err_chip;
+
+	return domain;
+
+err_chip:
+	kfree(chip);
+err_data:
+	kfree(data);
+err_info:
+	kfree(info);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_ims_create_msi_irq_domain);
--- /dev/null
+++ b/include/linux/irqchip/irq-ims-msi.h
@@ -0,0 +1,41 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* (C) Copyright 2020 Thomas Gleixner <tglx@linutronix.de> */
+
+#ifndef _LINUX_IRQCHIP_IRQ_IMS_MSI_H
+#define _LINUX_IRQCHIP_IRQ_IMS_MSI_H
+
+#include <linux/types.h>
+
+struct ims_array_slot {
+	u32	address_lo;
+	u32	address_hi;
+	u32	data;
+	u32	ctrl;
+};
+
+/* Bit to unmask the interrupt in slot->ctrl */
+#define IMS_VECTOR_CTRL_UNMASK	0x01
+
+struct ims_array_info {
+	struct ims_array_slot	__iomem *slots;
+	unsigned int		max_slots;
+};
+
+/* Dummy forward declaration for illustration */
+struct ims_queue_slot;
+
+/**
+ * ims_msi_store - Interrupt Message Store descriptor data
+ * @array_slot:	Pointer to a on device IMS storage array slot
+ * @queue_slot:	Pointer to storage embedded in queue data
+ * @hw_irq:	Index of the slot or queue. Also hardware irq number
+ */
+struct ims_msi_store {
+	union {
+		struct ims_array_slot __iomem	*array_slot;
+		struct ims_queue_slot		*queue_slot;
+	};
+	unsigned int				hw_irq;
+};
+
+#endif