Message ID | 20241208150711.297624-2-apatel@ventanamicro.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Move RISC-V IMSIC driver to the common MSI lib | expand |
On Sun, Dec 08 2024 at 20:37, Anup Patel wrote: > + > + tvec = vec->local_id == mvec->local_id ? > + NULL : &lpriv->vectors[mvec->local_id]; > + if (tvec && __imsic_id_read_clear_pending(tvec->local_id)) { As I told you before: I don't see a way how that can work remote with the IMSIC either even if you can easily access the pending state of the remote CPU: CPU0 CPU1 Device set_affinity() write_msg(tmp) write(addr); // CPU1 write(data); // vector 0x20 raise IRQ (CPU1, vector 0x20) handle vector 0x20 (other device) check_pending(CPU1, 0x20) == false -> Interrupt is lost There is no guarantee that set_affinity() runs on the original target CPU (CPU 1). Your scheme only works, when CPU1 vector 0x20 is not used by some other device. If it's used, you lost as CPU1 will consume the vector and your pending check is not seeing anything. x86 ensures CPU locality by deferring the affinity move to the next device interrupt on the original target CPU (CPU1 in the above example). See CONFIG_GENERIC_IRQ_PENDING. The interrupt domains which are not affected (remap) set the IRQ_MOVE_PCNTXT flag to avoid that dance and don't use that affinity setter code path at all. Thanks, tglx
Hi Thomas, On Mon, Dec 9, 2024 at 1:44 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Sun, Dec 08 2024 at 20:37, Anup Patel wrote: > > + > > + tvec = vec->local_id == mvec->local_id ? > > + NULL : &lpriv->vectors[mvec->local_id]; > > + if (tvec && __imsic_id_read_clear_pending(tvec->local_id)) { > > As I told you before: > > I don't see a way how that can work remote with the IMSIC either even if > you can easily access the pending state of the remote CPU: For RISC-V IMSIC, a remote CPU cannot access the pending state of interrupts on some other CPU. > > CPU0 CPU1 Device > set_affinity() > write_msg(tmp) > write(addr); // CPU1 > write(data); // vector 0x20 > raise IRQ (CPU1, vector 0x20) > handle vector 0x20 > (other device) > > check_pending(CPU1, 0x20) == false -> Interrupt is lost > > There is no guarantee that set_affinity() runs on the original target > CPU (CPU 1). Your scheme only works, when CPU1 vector 0x20 is not used > by some other device. If it's used, you lost as CPU1 will consume the > vector and your pending check is not seeing anything. > > x86 ensures CPU locality by deferring the affinity move to the next > device interrupt on the original target CPU (CPU1 in the above > example). See CONFIG_GENERIC_IRQ_PENDING. I agree with you. The IMSIC driver must do the affinity move upon the next device interrupt on the old CPU. I will update this patch in the next revision. BTW, I did not find CONFIG_GENERIC_IRQ_PENDING. Is the name correct ? > > The interrupt domains which are not affected (remap) set the > IRQ_MOVE_PCNTXT flag to avoid that dance and don't use that affinity > setter code path at all. Yes, setting the IRQ_MOVE_PCNTXT flag in the remap domain makes perfect sense. I suggest adding IRQ_MOVE_PCNTXT usage as part of Drew's irqbypass series which adds a remap domain in the IOMMU driver. Unless you insist on having it as part of this series ? Regards, Anup
Anup! On Mon, Dec 09 2024 at 17:38, Anup Patel wrote: > On Mon, Dec 9, 2024 at 1:44 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> There is no guarantee that set_affinity() runs on the original target >> CPU (CPU 1). Your scheme only works, when CPU1 vector 0x20 is not used >> by some other device. If it's used, you lost as CPU1 will consume the >> vector and your pending check is not seeing anything. >> >> x86 ensures CPU locality by deferring the affinity move to the next >> device interrupt on the original target CPU (CPU1 in the above >> example). See CONFIG_GENERIC_IRQ_PENDING. > > I agree with you. > > The IMSIC driver must do the affinity move upon the next device > interrupt on the old CPU. I will update this patch in the next revision. > > BTW, I did not find CONFIG_GENERIC_IRQ_PENDING. Is the > name correct ? CONFIG_GENERIC_PENDING_IRQ is close enough :) >> The interrupt domains which are not affected (remap) set the >> IRQ_MOVE_PCNTXT flag to avoid that dance and don't use that affinity >> setter code path at all. > > Yes, setting the IRQ_MOVE_PCNTXT flag in the remap domain > makes perfect sense. > > I suggest adding IRQ_MOVE_PCNTXT usage as part of Drew's > irqbypass series which adds a remap domain in the IOMMU > driver. Unless you insist on having it as part of this series ? You need to look at the other RISC-V controllers. Those which do not need this should set it. That's historically backwards. I think we can reverse the logic here. As this needs backporting, I can't make a full cleanup of this, but for your problem the patch below should just work. Select GENERIC_PENDING_IRQ and GENERIC_PENDING_IRQ_CHIPFLAGS and set the IRQCHIP_MOVE_DEFERRED flag on your interrrupt chip and the core logic takes care of the PCNTXT bits. I'll convert x86 in a seperate step and remove the PCNTXT leftovers and the new config knob once the dust has settled. Thanks, tglx --- --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -567,6 +567,7 @@ struct irq_chip { * in the suspend path if they are in disabled state * IRQCHIP_AFFINITY_PRE_STARTUP: Default affinity update before startup * IRQCHIP_IMMUTABLE: Don't ever change anything in this chip + * IRQCHIP_MOVE_DEFERRED: Move the interrupt in actual interrupt context */ enum { IRQCHIP_SET_TYPE_MASKED = (1 << 0), @@ -581,6 +582,7 @@ enum { IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND = (1 << 9), IRQCHIP_AFFINITY_PRE_STARTUP = (1 << 10), IRQCHIP_IMMUTABLE = (1 << 11), + IRQCHIP_MOVE_DEFERRED = (1 << 12), }; #include <linux/irqdesc.h> --- a/kernel/irq/Kconfig +++ b/kernel/irq/Kconfig @@ -31,6 +31,10 @@ config GENERIC_IRQ_EFFECTIVE_AFF_MASK config GENERIC_PENDING_IRQ bool +# Deduce delayed migration from top-level interrupt chip flags +config GENERIC_PENDING_IRQ_CHIPFLAGS + bool + # Support for generic irq migrating off cpu before the cpu is offline. config GENERIC_IRQ_MIGRATION bool --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -47,6 +47,13 @@ int irq_set_chip(unsigned int irq, const return -EINVAL; desc->irq_data.chip = (struct irq_chip *)(chip ?: &no_irq_chip); + + if (IS_ENABLED(CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS) && chip) { + if (chip->flags & IRQCHIP_MOVE_DEFERRED) + irqd_clear(&desc->irq_data, IRQD_MOVE_PCNTXT); + else + irqd_set(&desc->irq_data, IRQD_MOVE_PCNTXT); + } irq_put_desc_unlock(desc, flags); /* * For !CONFIG_SPARSE_IRQ make the irq show up in @@ -1114,16 +1121,21 @@ void irq_modify_status(unsigned int irq, trigger = irqd_get_trigger_type(&desc->irq_data); irqd_clear(&desc->irq_data, IRQD_NO_BALANCING | IRQD_PER_CPU | - IRQD_TRIGGER_MASK | IRQD_LEVEL | IRQD_MOVE_PCNTXT); + IRQD_TRIGGER_MASK | IRQD_LEVEL); if (irq_settings_has_no_balance_set(desc)) irqd_set(&desc->irq_data, IRQD_NO_BALANCING); if (irq_settings_is_per_cpu(desc)) irqd_set(&desc->irq_data, IRQD_PER_CPU); - if (irq_settings_can_move_pcntxt(desc)) - irqd_set(&desc->irq_data, IRQD_MOVE_PCNTXT); if (irq_settings_is_level(desc)) irqd_set(&desc->irq_data, IRQD_LEVEL); + /* Keep this around until x86 is converted over */ + if (!IS_ENABLED(CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS)) { + irqd_clear(&desc->irq_data, IRQD_MOVE_PCNTXT); + if (irq_settings_can_move_pcntxt(desc)) + irqd_set(&desc->irq_data, IRQD_MOVE_PCNTXT); + } + tmp = irq_settings_get_trigger_mask(desc); if (tmp != IRQ_TYPE_NONE) trigger = tmp;
On Mon, Dec 9, 2024 at 9:23 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > Anup! > > On Mon, Dec 09 2024 at 17:38, Anup Patel wrote: > > On Mon, Dec 9, 2024 at 1:44 AM Thomas Gleixner <tglx@linutronix.de> wrote: > >> There is no guarantee that set_affinity() runs on the original target > >> CPU (CPU 1). Your scheme only works, when CPU1 vector 0x20 is not used > >> by some other device. If it's used, you lost as CPU1 will consume the > >> vector and your pending check is not seeing anything. > >> > >> x86 ensures CPU locality by deferring the affinity move to the next > >> device interrupt on the original target CPU (CPU1 in the above > >> example). See CONFIG_GENERIC_IRQ_PENDING. > > > > I agree with you. > > > > The IMSIC driver must do the affinity move upon the next device > > interrupt on the old CPU. I will update this patch in the next revision. > > > > BTW, I did not find CONFIG_GENERIC_IRQ_PENDING. Is the > > name correct ? > > CONFIG_GENERIC_PENDING_IRQ is close enough :) > > >> The interrupt domains which are not affected (remap) set the > >> IRQ_MOVE_PCNTXT flag to avoid that dance and don't use that affinity > >> setter code path at all. > > > > Yes, setting the IRQ_MOVE_PCNTXT flag in the remap domain > > makes perfect sense. > > > > I suggest adding IRQ_MOVE_PCNTXT usage as part of Drew's > > irqbypass series which adds a remap domain in the IOMMU > > driver. Unless you insist on having it as part of this series ? > > You need to look at the other RISC-V controllers. Those which do not > need this should set it. That's historically backwards. I will update the RISC-V APLIC MSI-mode driver in the next revision. This driver is a good candidate to use IRQ_MOVE_PCNTXT and IRQCHIP_MOVE_DEFERRED. > > I think we can reverse the logic here. As this needs backporting, I > can't make a full cleanup of this, but for your problem the patch below > should just work. > > Select GENERIC_PENDING_IRQ and GENERIC_PENDING_IRQ_CHIPFLAGS and set the > IRQCHIP_MOVE_DEFERRED flag on your interrrupt chip and the core logic > takes care of the PCNTXT bits. Sure, I will update. Thanks, Anup > > I'll convert x86 in a seperate step and remove the PCNTXT leftovers and > the new config knob once the dust has settled. > > Thanks, > > tglx > --- > --- a/include/linux/irq.h > +++ b/include/linux/irq.h > @@ -567,6 +567,7 @@ struct irq_chip { > * in the suspend path if they are in disabled state > * IRQCHIP_AFFINITY_PRE_STARTUP: Default affinity update before startup > * IRQCHIP_IMMUTABLE: Don't ever change anything in this chip > + * IRQCHIP_MOVE_DEFERRED: Move the interrupt in actual interrupt context > */ > enum { > IRQCHIP_SET_TYPE_MASKED = (1 << 0), > @@ -581,6 +582,7 @@ enum { > IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND = (1 << 9), > IRQCHIP_AFFINITY_PRE_STARTUP = (1 << 10), > IRQCHIP_IMMUTABLE = (1 << 11), > + IRQCHIP_MOVE_DEFERRED = (1 << 12), > }; > > #include <linux/irqdesc.h> > --- a/kernel/irq/Kconfig > +++ b/kernel/irq/Kconfig > @@ -31,6 +31,10 @@ config GENERIC_IRQ_EFFECTIVE_AFF_MASK > config GENERIC_PENDING_IRQ > bool > > +# Deduce delayed migration from top-level interrupt chip flags > +config GENERIC_PENDING_IRQ_CHIPFLAGS > + bool > + > # Support for generic irq migrating off cpu before the cpu is offline. > config GENERIC_IRQ_MIGRATION > bool > --- a/kernel/irq/chip.c > +++ b/kernel/irq/chip.c > @@ -47,6 +47,13 @@ int irq_set_chip(unsigned int irq, const > return -EINVAL; > > desc->irq_data.chip = (struct irq_chip *)(chip ?: &no_irq_chip); > + > + if (IS_ENABLED(CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS) && chip) { > + if (chip->flags & IRQCHIP_MOVE_DEFERRED) > + irqd_clear(&desc->irq_data, IRQD_MOVE_PCNTXT); > + else > + irqd_set(&desc->irq_data, IRQD_MOVE_PCNTXT); > + } > irq_put_desc_unlock(desc, flags); > /* > * For !CONFIG_SPARSE_IRQ make the irq show up in > @@ -1114,16 +1121,21 @@ void irq_modify_status(unsigned int irq, > trigger = irqd_get_trigger_type(&desc->irq_data); > > irqd_clear(&desc->irq_data, IRQD_NO_BALANCING | IRQD_PER_CPU | > - IRQD_TRIGGER_MASK | IRQD_LEVEL | IRQD_MOVE_PCNTXT); > + IRQD_TRIGGER_MASK | IRQD_LEVEL); > if (irq_settings_has_no_balance_set(desc)) > irqd_set(&desc->irq_data, IRQD_NO_BALANCING); > if (irq_settings_is_per_cpu(desc)) > irqd_set(&desc->irq_data, IRQD_PER_CPU); > - if (irq_settings_can_move_pcntxt(desc)) > - irqd_set(&desc->irq_data, IRQD_MOVE_PCNTXT); > if (irq_settings_is_level(desc)) > irqd_set(&desc->irq_data, IRQD_LEVEL); > > + /* Keep this around until x86 is converted over */ > + if (!IS_ENABLED(CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS)) { > + irqd_clear(&desc->irq_data, IRQD_MOVE_PCNTXT); > + if (irq_settings_can_move_pcntxt(desc)) > + irqd_set(&desc->irq_data, IRQD_MOVE_PCNTXT); > + } > + > tmp = irq_settings_get_trigger_mask(desc); > if (tmp != IRQ_TYPE_NONE) > trigger = tmp; > > >
diff --git a/drivers/irqchip/irq-riscv-imsic-platform.c b/drivers/irqchip/irq-riscv-imsic-platform.c index c708780e8760..707c7ccb4d08 100644 --- a/drivers/irqchip/irq-riscv-imsic-platform.c +++ b/drivers/irqchip/irq-riscv-imsic-platform.c @@ -97,6 +97,7 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask { struct imsic_vector *old_vec, *new_vec; struct irq_data *pd = d->parent_data; + struct imsic_vector tmp_vec; old_vec = irq_data_get_irq_chip_data(pd); if (WARN_ON(!old_vec)) @@ -110,11 +111,37 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask if (imsic_vector_get_move(old_vec)) return -EBUSY; + /* + * Device having non-atomic MSI update might see an intermediate + * state when changing target IMSIC vector from one CPU to another. + * + * To avoid losing interrupt to some intermediate state, do the + * following (just like x86 APIC): + * + * 1) First write a temporary IMSIC vector to the device which + * has MSI address same as the old IMSIC vector but MSI data + * matches the new IMSIC vector. + * + * 2) Next write the new IMSIC vector to the device. + * + * Based on the above, the __imsic_local_sync() must check both + * old MSI data and new MSI data on the old CPU for pending + */ + /* Get a new vector on the desired set of CPUs */ new_vec = imsic_vector_alloc(old_vec->hwirq, mask_val); if (!new_vec) return -ENOSPC; + if (new_vec->local_id != old_vec->local_id) { + /* Setup temporary vector */ + tmp_vec.cpu = old_vec->cpu; + tmp_vec.local_id = new_vec->local_id; + + /* Point device to the temporary vector */ + imsic_msi_update_msg(d, &tmp_vec); + } + /* Point device to the new vector */ imsic_msi_update_msg(d, new_vec); diff --git a/drivers/irqchip/irq-riscv-imsic-state.c b/drivers/irqchip/irq-riscv-imsic-state.c index b97e6cd89ed7..230b917136e6 100644 --- a/drivers/irqchip/irq-riscv-imsic-state.c +++ b/drivers/irqchip/irq-riscv-imsic-state.c @@ -126,8 +126,8 @@ void __imsic_eix_update(unsigned long base_id, unsigned long num_id, bool pend, static void __imsic_local_sync(struct imsic_local_priv *lpriv) { - struct imsic_local_config *mlocal; - struct imsic_vector *vec, *mvec; + struct imsic_local_config *tlocal, *mlocal; + struct imsic_vector *vec, *tvec, *mvec; int i; lockdep_assert_held(&lpriv->lock); @@ -151,7 +151,28 @@ static void __imsic_local_sync(struct imsic_local_priv *lpriv) mvec = READ_ONCE(vec->move); WRITE_ONCE(vec->move, NULL); if (mvec && mvec != vec) { - if (__imsic_id_read_clear_pending(i)) { + /* + * Device having non-atomic MSI update might see an + * intermediate state so check both old ID and new ID + * for pending interrupts. + * + * For details, refer imsic_irq_set_affinity(). + */ + + tvec = vec->local_id == mvec->local_id ? + NULL : &lpriv->vectors[mvec->local_id]; + if (tvec && __imsic_id_read_clear_pending(tvec->local_id)) { + /* Retrigger temporary vector if it was already in-use */ + if (READ_ONCE(tvec->enable)) { + tlocal = per_cpu_ptr(imsic->global.local, tvec->cpu); + writel_relaxed(tvec->local_id, tlocal->msi_va); + } + + mlocal = per_cpu_ptr(imsic->global.local, mvec->cpu); + writel_relaxed(mvec->local_id, mlocal->msi_va); + } + + if (__imsic_id_read_clear_pending(vec->local_id)) { mlocal = per_cpu_ptr(imsic->global.local, mvec->cpu); writel_relaxed(mvec->local_id, mlocal->msi_va); }
Device having non-atomic MSI update might see an intermediate state when changing target IMSIC vector from one CPU to another. To handle such intermediate device state, update MSI address and MSI data through separate MSI writes to the device. Fixes: 027e125acdba ("irqchip/riscv-imsic: Add device MSI domain support for platform devices") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> --- drivers/irqchip/irq-riscv-imsic-platform.c | 27 ++++++++++++++++++++++ drivers/irqchip/irq-riscv-imsic-state.c | 27 +++++++++++++++++++--- 2 files changed, 51 insertions(+), 3 deletions(-)