diff mbox series

[1/4] irqchip/riscv-imsic: Handle non-atomic MSI updates for device

Message ID 20241208150711.297624-2-apatel@ventanamicro.com (mailing list archive)
State New
Headers show
Series Move RISC-V IMSIC driver to the common MSI lib | expand

Commit Message

Anup Patel Dec. 8, 2024, 3:07 p.m. UTC
Device having non-atomic MSI update might see an intermediate
state when changing target IMSIC vector from one CPU to another.

To handle such intermediate device state, update MSI address
and MSI data through separate MSI writes to the device.

Fixes: 027e125acdba ("irqchip/riscv-imsic: Add device MSI domain support for platform devices")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
---
 drivers/irqchip/irq-riscv-imsic-platform.c | 27 ++++++++++++++++++++++
 drivers/irqchip/irq-riscv-imsic-state.c    | 27 +++++++++++++++++++---
 2 files changed, 51 insertions(+), 3 deletions(-)

Comments

Thomas Gleixner Dec. 8, 2024, 8:14 p.m. UTC | #1
On Sun, Dec 08 2024 at 20:37, Anup Patel wrote:
> +
> +			tvec = vec->local_id == mvec->local_id ?
> +			       NULL : &lpriv->vectors[mvec->local_id];
> +			if (tvec && __imsic_id_read_clear_pending(tvec->local_id)) {

As I told you before:

I don't see a way how that can work remote with the IMSIC either even if
you can easily access the pending state of the remote CPU:

CPU0                            CPU1                   Device
set_affinity()
  write_msg(tmp)
    write(addr); // CPU1
    write(data); // vector 0x20
							raise IRQ (CPU1, vector 0x20)
				handle vector 0x20
				(other device)

    check_pending(CPU1, 0x20) == false -> Interrupt is lost

There is no guarantee that set_affinity() runs on the original target
CPU (CPU 1). Your scheme only works, when CPU1 vector 0x20 is not used
by some other device. If it's used, you lost as CPU1 will consume the
vector and your pending check is not seeing anything.

x86 ensures CPU locality by deferring the affinity move to the next
device interrupt on the original target CPU (CPU1 in the above
example). See CONFIG_GENERIC_IRQ_PENDING.

The interrupt domains which are not affected (remap) set the
IRQ_MOVE_PCNTXT flag to avoid that dance and don't use that affinity
setter code path at all.

Thanks,

        tglx
Anup Patel Dec. 9, 2024, 12:08 p.m. UTC | #2
Hi Thomas,

On Mon, Dec 9, 2024 at 1:44 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Sun, Dec 08 2024 at 20:37, Anup Patel wrote:
> > +
> > +                     tvec = vec->local_id == mvec->local_id ?
> > +                            NULL : &lpriv->vectors[mvec->local_id];
> > +                     if (tvec && __imsic_id_read_clear_pending(tvec->local_id)) {
>
> As I told you before:
>
> I don't see a way how that can work remote with the IMSIC either even if
> you can easily access the pending state of the remote CPU:

For RISC-V IMSIC, a remote CPU cannot access the pending state
of interrupts on some other CPU.

>
> CPU0                            CPU1                   Device
> set_affinity()
>   write_msg(tmp)
>     write(addr); // CPU1
>     write(data); // vector 0x20
>                                                         raise IRQ (CPU1, vector 0x20)
>                                 handle vector 0x20
>                                 (other device)
>
>     check_pending(CPU1, 0x20) == false -> Interrupt is lost
>
> There is no guarantee that set_affinity() runs on the original target
> CPU (CPU 1). Your scheme only works, when CPU1 vector 0x20 is not used
> by some other device. If it's used, you lost as CPU1 will consume the
> vector and your pending check is not seeing anything.
>
> x86 ensures CPU locality by deferring the affinity move to the next
> device interrupt on the original target CPU (CPU1 in the above
> example). See CONFIG_GENERIC_IRQ_PENDING.

I agree with you.

The IMSIC driver must do the affinity move upon the next device
interrupt on the old CPU. I will update this patch in the next revision.

BTW, I did not find CONFIG_GENERIC_IRQ_PENDING. Is the
name correct ?

>
> The interrupt domains which are not affected (remap) set the
> IRQ_MOVE_PCNTXT flag to avoid that dance and don't use that affinity
> setter code path at all.

Yes, setting the IRQ_MOVE_PCNTXT flag in the remap domain
makes perfect sense.

I suggest adding IRQ_MOVE_PCNTXT usage as part of Drew's
irqbypass series which adds a remap domain in the IOMMU
driver. Unless you insist on having it as part of this series ?

Regards,
Anup
Thomas Gleixner Dec. 9, 2024, 3:53 p.m. UTC | #3
Anup!

On Mon, Dec 09 2024 at 17:38, Anup Patel wrote:
> On Mon, Dec 9, 2024 at 1:44 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> There is no guarantee that set_affinity() runs on the original target
>> CPU (CPU 1). Your scheme only works, when CPU1 vector 0x20 is not used
>> by some other device. If it's used, you lost as CPU1 will consume the
>> vector and your pending check is not seeing anything.
>>
>> x86 ensures CPU locality by deferring the affinity move to the next
>> device interrupt on the original target CPU (CPU1 in the above
>> example). See CONFIG_GENERIC_IRQ_PENDING.
>
> I agree with you.
>
> The IMSIC driver must do the affinity move upon the next device
> interrupt on the old CPU. I will update this patch in the next revision.
>
> BTW, I did not find CONFIG_GENERIC_IRQ_PENDING. Is the
> name correct ?

CONFIG_GENERIC_PENDING_IRQ is close enough :)

>> The interrupt domains which are not affected (remap) set the
>> IRQ_MOVE_PCNTXT flag to avoid that dance and don't use that affinity
>> setter code path at all.
>
> Yes, setting the IRQ_MOVE_PCNTXT flag in the remap domain
> makes perfect sense.
>
> I suggest adding IRQ_MOVE_PCNTXT usage as part of Drew's
> irqbypass series which adds a remap domain in the IOMMU
> driver. Unless you insist on having it as part of this series ?

You need to look at the other RISC-V controllers. Those which do not
need this should set it. That's historically backwards.

I think we can reverse the logic here. As this needs backporting, I
can't make a full cleanup of this, but for your problem the patch below
should just work.

Select GENERIC_PENDING_IRQ and GENERIC_PENDING_IRQ_CHIPFLAGS and set the
IRQCHIP_MOVE_DEFERRED flag on your interrrupt chip and the core logic
takes care of the PCNTXT bits.

I'll convert x86 in a seperate step and remove the PCNTXT leftovers and
the new config knob once the dust has settled.

Thanks,

        tglx
---
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -567,6 +567,7 @@ struct irq_chip {
  *                                    in the suspend path if they are in disabled state
  * IRQCHIP_AFFINITY_PRE_STARTUP:      Default affinity update before startup
  * IRQCHIP_IMMUTABLE:		      Don't ever change anything in this chip
+ * IRQCHIP_MOVE_DEFERRED:	      Move the interrupt in actual interrupt context
  */
 enum {
 	IRQCHIP_SET_TYPE_MASKED			= (1 <<  0),
@@ -581,6 +582,7 @@ enum {
 	IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND	= (1 <<  9),
 	IRQCHIP_AFFINITY_PRE_STARTUP		= (1 << 10),
 	IRQCHIP_IMMUTABLE			= (1 << 11),
+	IRQCHIP_MOVE_DEFERRED			= (1 << 12),
 };
 
 #include <linux/irqdesc.h>
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -31,6 +31,10 @@ config GENERIC_IRQ_EFFECTIVE_AFF_MASK
 config GENERIC_PENDING_IRQ
 	bool
 
+# Deduce delayed migration from top-level interrupt chip flags
+config GENERIC_PENDING_IRQ_CHIPFLAGS
+	bool
+
 # Support for generic irq migrating off cpu before the cpu is offline.
 config GENERIC_IRQ_MIGRATION
 	bool
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -47,6 +47,13 @@ int irq_set_chip(unsigned int irq, const
 		return -EINVAL;
 
 	desc->irq_data.chip = (struct irq_chip *)(chip ?: &no_irq_chip);
+
+	if (IS_ENABLED(CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS) && chip) {
+		if (chip->flags & IRQCHIP_MOVE_DEFERRED)
+			irqd_clear(&desc->irq_data, IRQD_MOVE_PCNTXT);
+		else
+			irqd_set(&desc->irq_data, IRQD_MOVE_PCNTXT);
+	}
 	irq_put_desc_unlock(desc, flags);
 	/*
 	 * For !CONFIG_SPARSE_IRQ make the irq show up in
@@ -1114,16 +1121,21 @@ void irq_modify_status(unsigned int irq,
 	trigger = irqd_get_trigger_type(&desc->irq_data);
 
 	irqd_clear(&desc->irq_data, IRQD_NO_BALANCING | IRQD_PER_CPU |
-		   IRQD_TRIGGER_MASK | IRQD_LEVEL | IRQD_MOVE_PCNTXT);
+		   IRQD_TRIGGER_MASK | IRQD_LEVEL);
 	if (irq_settings_has_no_balance_set(desc))
 		irqd_set(&desc->irq_data, IRQD_NO_BALANCING);
 	if (irq_settings_is_per_cpu(desc))
 		irqd_set(&desc->irq_data, IRQD_PER_CPU);
-	if (irq_settings_can_move_pcntxt(desc))
-		irqd_set(&desc->irq_data, IRQD_MOVE_PCNTXT);
 	if (irq_settings_is_level(desc))
 		irqd_set(&desc->irq_data, IRQD_LEVEL);
 
+	/* Keep this around until x86 is converted over */
+	if (!IS_ENABLED(CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS)) {
+		irqd_clear(&desc->irq_data, IRQD_MOVE_PCNTXT);
+		if (irq_settings_can_move_pcntxt(desc))
+			irqd_set(&desc->irq_data, IRQD_MOVE_PCNTXT);
+	}
+
 	tmp = irq_settings_get_trigger_mask(desc);
 	if (tmp != IRQ_TYPE_NONE)
 		trigger = tmp;
Anup Patel Dec. 9, 2024, 5:09 p.m. UTC | #4
On Mon, Dec 9, 2024 at 9:23 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Anup!
>
> On Mon, Dec 09 2024 at 17:38, Anup Patel wrote:
> > On Mon, Dec 9, 2024 at 1:44 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> There is no guarantee that set_affinity() runs on the original target
> >> CPU (CPU 1). Your scheme only works, when CPU1 vector 0x20 is not used
> >> by some other device. If it's used, you lost as CPU1 will consume the
> >> vector and your pending check is not seeing anything.
> >>
> >> x86 ensures CPU locality by deferring the affinity move to the next
> >> device interrupt on the original target CPU (CPU1 in the above
> >> example). See CONFIG_GENERIC_IRQ_PENDING.
> >
> > I agree with you.
> >
> > The IMSIC driver must do the affinity move upon the next device
> > interrupt on the old CPU. I will update this patch in the next revision.
> >
> > BTW, I did not find CONFIG_GENERIC_IRQ_PENDING. Is the
> > name correct ?
>
> CONFIG_GENERIC_PENDING_IRQ is close enough :)
>
> >> The interrupt domains which are not affected (remap) set the
> >> IRQ_MOVE_PCNTXT flag to avoid that dance and don't use that affinity
> >> setter code path at all.
> >
> > Yes, setting the IRQ_MOVE_PCNTXT flag in the remap domain
> > makes perfect sense.
> >
> > I suggest adding IRQ_MOVE_PCNTXT usage as part of Drew's
> > irqbypass series which adds a remap domain in the IOMMU
> > driver. Unless you insist on having it as part of this series ?
>
> You need to look at the other RISC-V controllers. Those which do not
> need this should set it. That's historically backwards.

I will update the RISC-V APLIC MSI-mode driver in the next revision.
This driver is a good candidate to use IRQ_MOVE_PCNTXT and
IRQCHIP_MOVE_DEFERRED.

>
> I think we can reverse the logic here. As this needs backporting, I
> can't make a full cleanup of this, but for your problem the patch below
> should just work.
>
> Select GENERIC_PENDING_IRQ and GENERIC_PENDING_IRQ_CHIPFLAGS and set the
> IRQCHIP_MOVE_DEFERRED flag on your interrrupt chip and the core logic
> takes care of the PCNTXT bits.

Sure, I will update.

Thanks,
Anup


>
> I'll convert x86 in a seperate step and remove the PCNTXT leftovers and
> the new config knob once the dust has settled.
>
> Thanks,
>
>         tglx
> ---
> --- a/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -567,6 +567,7 @@ struct irq_chip {
>   *                                    in the suspend path if they are in disabled state
>   * IRQCHIP_AFFINITY_PRE_STARTUP:      Default affinity update before startup
>   * IRQCHIP_IMMUTABLE:                Don't ever change anything in this chip
> + * IRQCHIP_MOVE_DEFERRED:            Move the interrupt in actual interrupt context
>   */
>  enum {
>         IRQCHIP_SET_TYPE_MASKED                 = (1 <<  0),
> @@ -581,6 +582,7 @@ enum {
>         IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND        = (1 <<  9),
>         IRQCHIP_AFFINITY_PRE_STARTUP            = (1 << 10),
>         IRQCHIP_IMMUTABLE                       = (1 << 11),
> +       IRQCHIP_MOVE_DEFERRED                   = (1 << 12),
>  };
>
>  #include <linux/irqdesc.h>
> --- a/kernel/irq/Kconfig
> +++ b/kernel/irq/Kconfig
> @@ -31,6 +31,10 @@ config GENERIC_IRQ_EFFECTIVE_AFF_MASK
>  config GENERIC_PENDING_IRQ
>         bool
>
> +# Deduce delayed migration from top-level interrupt chip flags
> +config GENERIC_PENDING_IRQ_CHIPFLAGS
> +       bool
> +
>  # Support for generic irq migrating off cpu before the cpu is offline.
>  config GENERIC_IRQ_MIGRATION
>         bool
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -47,6 +47,13 @@ int irq_set_chip(unsigned int irq, const
>                 return -EINVAL;
>
>         desc->irq_data.chip = (struct irq_chip *)(chip ?: &no_irq_chip);
> +
> +       if (IS_ENABLED(CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS) && chip) {
> +               if (chip->flags & IRQCHIP_MOVE_DEFERRED)
> +                       irqd_clear(&desc->irq_data, IRQD_MOVE_PCNTXT);
> +               else
> +                       irqd_set(&desc->irq_data, IRQD_MOVE_PCNTXT);
> +       }
>         irq_put_desc_unlock(desc, flags);
>         /*
>          * For !CONFIG_SPARSE_IRQ make the irq show up in
> @@ -1114,16 +1121,21 @@ void irq_modify_status(unsigned int irq,
>         trigger = irqd_get_trigger_type(&desc->irq_data);
>
>         irqd_clear(&desc->irq_data, IRQD_NO_BALANCING | IRQD_PER_CPU |
> -                  IRQD_TRIGGER_MASK | IRQD_LEVEL | IRQD_MOVE_PCNTXT);
> +                  IRQD_TRIGGER_MASK | IRQD_LEVEL);
>         if (irq_settings_has_no_balance_set(desc))
>                 irqd_set(&desc->irq_data, IRQD_NO_BALANCING);
>         if (irq_settings_is_per_cpu(desc))
>                 irqd_set(&desc->irq_data, IRQD_PER_CPU);
> -       if (irq_settings_can_move_pcntxt(desc))
> -               irqd_set(&desc->irq_data, IRQD_MOVE_PCNTXT);
>         if (irq_settings_is_level(desc))
>                 irqd_set(&desc->irq_data, IRQD_LEVEL);
>
> +       /* Keep this around until x86 is converted over */
> +       if (!IS_ENABLED(CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS)) {
> +               irqd_clear(&desc->irq_data, IRQD_MOVE_PCNTXT);
> +               if (irq_settings_can_move_pcntxt(desc))
> +                       irqd_set(&desc->irq_data, IRQD_MOVE_PCNTXT);
> +       }
> +
>         tmp = irq_settings_get_trigger_mask(desc);
>         if (tmp != IRQ_TYPE_NONE)
>                 trigger = tmp;
>
>
>
diff mbox series

Patch

diff --git a/drivers/irqchip/irq-riscv-imsic-platform.c b/drivers/irqchip/irq-riscv-imsic-platform.c
index c708780e8760..707c7ccb4d08 100644
--- a/drivers/irqchip/irq-riscv-imsic-platform.c
+++ b/drivers/irqchip/irq-riscv-imsic-platform.c
@@ -97,6 +97,7 @@  static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask
 {
 	struct imsic_vector *old_vec, *new_vec;
 	struct irq_data *pd = d->parent_data;
+	struct imsic_vector tmp_vec;
 
 	old_vec = irq_data_get_irq_chip_data(pd);
 	if (WARN_ON(!old_vec))
@@ -110,11 +111,37 @@  static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask
 	if (imsic_vector_get_move(old_vec))
 		return -EBUSY;
 
+	/*
+	 * Device having non-atomic MSI update might see an intermediate
+	 * state when changing target IMSIC vector from one CPU to another.
+	 *
+	 * To avoid losing interrupt to some intermediate state, do the
+	 * following (just like x86 APIC):
+	 *
+	 * 1) First write a temporary IMSIC vector to the device which
+	 * has MSI address same as the old IMSIC vector but MSI data
+	 * matches the new IMSIC vector.
+	 *
+	 * 2) Next write the new IMSIC vector to the device.
+	 *
+	 * Based on the above, the __imsic_local_sync() must check both
+	 * old MSI data and new MSI data on the old CPU for pending
+	 */
+
 	/* Get a new vector on the desired set of CPUs */
 	new_vec = imsic_vector_alloc(old_vec->hwirq, mask_val);
 	if (!new_vec)
 		return -ENOSPC;
 
+	if (new_vec->local_id != old_vec->local_id) {
+		/* Setup temporary vector */
+		tmp_vec.cpu = old_vec->cpu;
+		tmp_vec.local_id = new_vec->local_id;
+
+		/* Point device to the temporary vector */
+		imsic_msi_update_msg(d, &tmp_vec);
+	}
+
 	/* Point device to the new vector */
 	imsic_msi_update_msg(d, new_vec);
 
diff --git a/drivers/irqchip/irq-riscv-imsic-state.c b/drivers/irqchip/irq-riscv-imsic-state.c
index b97e6cd89ed7..230b917136e6 100644
--- a/drivers/irqchip/irq-riscv-imsic-state.c
+++ b/drivers/irqchip/irq-riscv-imsic-state.c
@@ -126,8 +126,8 @@  void __imsic_eix_update(unsigned long base_id, unsigned long num_id, bool pend,
 
 static void __imsic_local_sync(struct imsic_local_priv *lpriv)
 {
-	struct imsic_local_config *mlocal;
-	struct imsic_vector *vec, *mvec;
+	struct imsic_local_config *tlocal, *mlocal;
+	struct imsic_vector *vec, *tvec, *mvec;
 	int i;
 
 	lockdep_assert_held(&lpriv->lock);
@@ -151,7 +151,28 @@  static void __imsic_local_sync(struct imsic_local_priv *lpriv)
 		mvec = READ_ONCE(vec->move);
 		WRITE_ONCE(vec->move, NULL);
 		if (mvec && mvec != vec) {
-			if (__imsic_id_read_clear_pending(i)) {
+			/*
+			 * Device having non-atomic MSI update might see an
+			 * intermediate state so check both old ID and new ID
+			 * for pending interrupts.
+			 *
+			 * For details, refer imsic_irq_set_affinity().
+			 */
+
+			tvec = vec->local_id == mvec->local_id ?
+			       NULL : &lpriv->vectors[mvec->local_id];
+			if (tvec && __imsic_id_read_clear_pending(tvec->local_id)) {
+				/* Retrigger temporary vector if it was already in-use */
+				if (READ_ONCE(tvec->enable)) {
+					tlocal = per_cpu_ptr(imsic->global.local, tvec->cpu);
+					writel_relaxed(tvec->local_id, tlocal->msi_va);
+				}
+
+				mlocal = per_cpu_ptr(imsic->global.local, mvec->cpu);
+				writel_relaxed(mvec->local_id, mlocal->msi_va);
+			}
+
+			if (__imsic_id_read_clear_pending(vec->local_id)) {
 				mlocal = per_cpu_ptr(imsic->global.local, mvec->cpu);
 				writel_relaxed(mvec->local_id, mlocal->msi_va);
 			}