diff mbox series

[RFC,06/12] genirq: Add per-cpu flow handler with conditional IRQ stats

Message ID 20240604050940.859909-7-mhklinux@outlook.com (mailing list archive)
State Not Applicable
Headers show
Series Hyper-V guests use Linux IRQs for channel interrupts | expand

Commit Message

Michael Kelley June 4, 2024, 5:09 a.m. UTC
From: Michael Kelley <mhklinux@outlook.com>

Hyper-V VMBus devices generate interrupts that are multiplexed
onto a single per-CPU architectural interrupt. The top-level VMBus
driver ISR demultiplexes these interrupts and invokes per-device
handlers. Currently, these per-device handlers are not modeled as
Linux IRQs, so /proc/interrupts shows all VMBus interrupts as accounted
to the top level architectural interrupt. Visibility into per-device
interrupt stats requires accessing VMBus-specific entries in sysfs.
The top-level VMBus driver ISR also handles management-related
interrupts that are not attributable to a particular VMBus device.

As part of changing VMBus to model VMBus per-device handlers as
normal Linux IRQs, the top-level VMBus driver needs to conditionally
account for interrupts. If it passes the interrupt off to a
device-specific IRQ, the interrupt stats are done by that IRQ
handler, and accounting for the interrupt at the top level
is duplicative. But if it handles a management-related interrupt
itself, then it should account for the interrupt itself.

Introduce a new flow handler that provides this functionality.
The new handler parallels handle_percpu_irq(), but does stats
only if the ISR returns other than IRQ_NONE. The existing
handle_untracked_irq() can't be used because it doesn't work for
per-cpu IRQs, and it doesn't provide conditional stats.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 include/linux/irq.h |  1 +
 kernel/irq/chip.c   | 29 +++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

Comments

Thomas Gleixner June 4, 2024, 6:13 p.m. UTC | #1
Michael!

On Mon, Jun 03 2024 at 22:09, mhkelley58@gmail.com wrote:
> Hyper-V VMBus devices generate interrupts that are multiplexed
> onto a single per-CPU architectural interrupt. The top-level VMBus
> driver ISR demultiplexes these interrupts and invokes per-device
> handlers. Currently, these per-device handlers are not modeled as
> Linux IRQs, so /proc/interrupts shows all VMBus interrupts as accounted
> to the top level architectural interrupt. Visibility into per-device
> interrupt stats requires accessing VMBus-specific entries in sysfs.
> The top-level VMBus driver ISR also handles management-related
> interrupts that are not attributable to a particular VMBus device.
>
> As part of changing VMBus to model VMBus per-device handlers as
> normal Linux IRQs, the top-level VMBus driver needs to conditionally
> account for interrupts. If it passes the interrupt off to a
> device-specific IRQ, the interrupt stats are done by that IRQ
> handler, and accounting for the interrupt at the top level
> is duplicative. But if it handles a management-related interrupt
> itself, then it should account for the interrupt itself.
>
> Introduce a new flow handler that provides this functionality.
> The new handler parallels handle_percpu_irq(), but does stats
> only if the ISR returns other than IRQ_NONE. The existing
> handle_untracked_irq() can't be used because it doesn't work for
> per-cpu IRQs, and it doesn't provide conditional stats.

There is a two other options to solve this:

   1) Move the inner workings of handle_percpu_irq() out into
      a static function which returns the 'handled' value and
      share it between the two handler functions.

   2) Allocate a proper interrupt for the management mode and invoke it
      via generic_handle_irq() just as any other demultiplex interrupt.
      That spares all the special casing in the core code and just
      works.

Thanks,

        tglx
Michael Kelley June 4, 2024, 11:03 p.m. UTC | #2
From: Thomas Gleixner <tglx@linutronix.de> Sent: Tuesday, June 4, 2024 11:14 AM
> 
> Michael!
> 
> On Mon, Jun 03 2024 at 22:09, mhkelley58@gmail.com wrote:
> > Hyper-V VMBus devices generate interrupts that are multiplexed
> > onto a single per-CPU architectural interrupt. The top-level VMBus
> > driver ISR demultiplexes these interrupts and invokes per-device
> > handlers. Currently, these per-device handlers are not modeled as
> > Linux IRQs, so /proc/interrupts shows all VMBus interrupts as accounted
> > to the top level architectural interrupt. Visibility into per-device
> > interrupt stats requires accessing VMBus-specific entries in sysfs.
> > The top-level VMBus driver ISR also handles management-related
> > interrupts that are not attributable to a particular VMBus device.
> >
> > As part of changing VMBus to model VMBus per-device handlers as
> > normal Linux IRQs, the top-level VMBus driver needs to conditionally
> > account for interrupts. If it passes the interrupt off to a
> > device-specific IRQ, the interrupt stats are done by that IRQ
> > handler, and accounting for the interrupt at the top level
> > is duplicative. But if it handles a management-related interrupt
> > itself, then it should account for the interrupt itself.
> >
> > Introduce a new flow handler that provides this functionality.
> > The new handler parallels handle_percpu_irq(), but does stats
> > only if the ISR returns other than IRQ_NONE. The existing
> > handle_untracked_irq() can't be used because it doesn't work for
> > per-cpu IRQs, and it doesn't provide conditional stats.
> 
> There is a two other options to solve this:
> 

Thanks for taking a look.  Unfortunately, unless I'm missing something,
both options you suggest have downsides.

>    1) Move the inner workings of handle_percpu_irq() out into
>       a static function which returns the 'handled' value and
>       share it between the two handler functions.

The "inner workings" aren't quite the same in the two cases.
handle_percpu_irq() uses handle_irq_event_percpu() while
handle_percpu_demux_irq() uses __handle_irq_event_percpu().
The latter doesn't do add_interrupt_randomness() because the
demultiplexed IRQ handler will do it.  Doing add_interrupt_randomness()
twice doesn't break anything, but it's more overhead in the hard irq
path, which I'm trying to avoid.  The extra functionality in the
non-double-underscore version could be hoisted up to
handle_percpu_irq(), but that offsets gains from sharing the
inner workings.

> 
>    2) Allocate a proper interrupt for the management mode and invoke it
>       via generic_handle_irq() just as any other demultiplex interrupt.
>       That spares all the special casing in the core code and just
>       works.

Yes, this would work on x86, as the top-level interrupt isn't a Linux IRQ,
and the interrupt counting is done in Hyper-V specific code that could be
removed.  The demux'ed interrupt does the counting.

But on arm64 the top-level interrupt *is* a Linux IRQ, so each
interrupt will get double-counted, which is a problem.  Having to add
handle_percpu_demux_irq() to handle arm64 correctly isn't as clean
as I wish it could be.  But I couldn't find a better approach.

Michael
Thomas Gleixner June 5, 2024, 1:20 p.m. UTC | #3
On Tue, Jun 04 2024 at 23:03, Michael Kelley wrote:
> From: Thomas Gleixner <tglx@linutronix.de> Sent: Tuesday, June 4, 2024 11:14 AM
>>    1) Move the inner workings of handle_percpu_irq() out into
>>       a static function which returns the 'handled' value and
>>       share it between the two handler functions.
>
> The "inner workings" aren't quite the same in the two cases.
> handle_percpu_irq() uses handle_irq_event_percpu() while
> handle_percpu_demux_irq() uses __handle_irq_event_percpu().
> The latter doesn't do add_interrupt_randomness() because the
> demultiplexed IRQ handler will do it.  Doing add_interrupt_randomness()
> twice doesn't break anything, but it's more overhead in the hard irq
> path, which I'm trying to avoid.  The extra functionality in the
> non-double-underscore version could be hoisted up to
> handle_percpu_irq(), but that offsets gains from sharing the
> inner workings.

That's not rocket science to solve:

static irqreturn_t helper(desc, func)
{
	boiler_plate..
        ret = func(desc)
	boiler_plate..
        return ret;
}

No?

TBH, I still hate that conditional accounting :)

>>    2) Allocate a proper interrupt for the management mode and invoke it
>>       via generic_handle_irq() just as any other demultiplex interrupt.
>>       That spares all the special casing in the core code and just
>>       works.
>
> Yes, this would work on x86, as the top-level interrupt isn't a Linux IRQ,
> and the interrupt counting is done in Hyper-V specific code that could be
> removed.  The demux'ed interrupt does the counting.
>
> But on arm64 the top-level interrupt *is* a Linux IRQ, so each
> interrupt will get double-counted, which is a problem.

What is the problem?

You have: toplevel, mgmt, device[], right?

They are all accounted for seperately and each toplevel interrupt might
result in demultiplexing one or more interrupts (mgmt, device[]), no?

IMO accounting the toplevel interrupt seperately is informative because
it allows you to figure out whether demultiplexing is clustered or not,
but I lost that argument long ago. That's why most demultiplex muck
installs a chained handler, which is a design fail on it's own.

Thanks,

        tglx
Michael Kelley June 5, 2024, 1:45 p.m. UTC | #4
From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM
> 
> On Tue, Jun 04 2024 at 23:03, Michael Kelley wrote:
> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Tuesday, June 4, 2024 11:14 AM
> >>    1) Move the inner workings of handle_percpu_irq() out into
> >>       a static function which returns the 'handled' value and
> >>       share it between the two handler functions.
> >
> > The "inner workings" aren't quite the same in the two cases.
> > handle_percpu_irq() uses handle_irq_event_percpu() while
> > handle_percpu_demux_irq() uses __handle_irq_event_percpu().
> > The latter doesn't do add_interrupt_randomness() because the
> > demultiplexed IRQ handler will do it.  Doing add_interrupt_randomness()
> > twice doesn't break anything, but it's more overhead in the hard irq
> > path, which I'm trying to avoid.  The extra functionality in the
> > non-double-underscore version could be hoisted up to
> > handle_percpu_irq(), but that offsets gains from sharing the
> > inner workings.
> 
> That's not rocket science to solve:
> 
> static irqreturn_t helper(desc, func)
> {
> 	boiler_plate..
>         ret = func(desc)
> 	boiler_plate..
>         return ret;
> }
> 
> No?
> 
> TBH, I still hate that conditional accounting :)
> 
> >>    2) Allocate a proper interrupt for the management mode and invoke it
> >>       via generic_handle_irq() just as any other demultiplex interrupt.
> >>       That spares all the special casing in the core code and just
> >>       works.
> >
> > Yes, this would work on x86, as the top-level interrupt isn't a Linux IRQ,
> > and the interrupt counting is done in Hyper-V specific code that could be
> > removed.  The demux'ed interrupt does the counting.
> >
> > But on arm64 the top-level interrupt *is* a Linux IRQ, so each
> > interrupt will get double-counted, which is a problem.
> 
> What is the problem?
> 
> You have: toplevel, mgmt, device[], right?
> 
> They are all accounted for seperately and each toplevel interrupt might
> result in demultiplexing one or more interrupts (mgmt, device[]), no?
> 
> IMO accounting the toplevel interrupt seperately is informative because
> it allows you to figure out whether demultiplexing is clustered or not,
> but I lost that argument long ago. That's why most demultiplex muck
> installs a chained handler, which is a design fail on it's own.

In /proc/interrupts, the double-counting isn't a problem, and is
potentially helpful as you say. But /proc/stat, for example, shows a total
interrupt count, which will be roughly double what it was before. That
/proc/stat value then shows up in user space in vmstat, for example.
That's what I was concerned about, though it's not a huge problem in
the grand scheme of things.

Michael
Thomas Gleixner June 5, 2024, 2:19 p.m. UTC | #5
On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
> From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM
>
> In /proc/interrupts, the double-counting isn't a problem, and is
> potentially helpful as you say. But /proc/stat, for example, shows a total
> interrupt count, which will be roughly double what it was before. That
> /proc/stat value then shows up in user space in vmstat, for example.
> That's what I was concerned about, though it's not a huge problem in
> the grand scheme of things.

That's trivial to solve. We can mark interrupts to be excluded from
/proc/stat accounting.

Thanks,

        tglx
Michael Kelley June 6, 2024, 3:14 a.m. UTC | #6
From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 7:20 AM
> 
> On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM
> >
> > In /proc/interrupts, the double-counting isn't a problem, and is
> > potentially helpful as you say. But /proc/stat, for example, shows a total
> > interrupt count, which will be roughly double what it was before. That
> > /proc/stat value then shows up in user space in vmstat, for example.
> > That's what I was concerned about, though it's not a huge problem in
> > the grand scheme of things.
> 
> That's trivial to solve. We can mark interrupts to be excluded from
> /proc/stat accounting.
> 

OK.  On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter
out the HYP interrupts. But what do you envision on arm64, where
there is no arch_irq_stat_cpu()?  On arm64, the top-level interrupt is a
normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field
with no breakout by IRQ. Identifying the right IRQ and subtracting it
out later looks a lot uglier than the conditional stats accounting.

Or if there's some other approach I'm missing, please enlighten me!

Michael
Thomas Gleixner June 6, 2024, 9:34 a.m. UTC | #7
On Thu, Jun 06 2024 at 03:14, Michael Kelley wrote:
> From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 7:20 AM
>> 
>> On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
>> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM
>> >
>> > In /proc/interrupts, the double-counting isn't a problem, and is
>> > potentially helpful as you say. But /proc/stat, for example, shows a total
>> > interrupt count, which will be roughly double what it was before. That
>> > /proc/stat value then shows up in user space in vmstat, for example.
>> > That's what I was concerned about, though it's not a huge problem in
>> > the grand scheme of things.
>> 
>> That's trivial to solve. We can mark interrupts to be excluded from
>> /proc/stat accounting.
>> 
>
> OK.  On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter
> out the HYP interrupts. But what do you envision on arm64, where
> there is no arch_irq_stat_cpu()?  On arm64, the top-level interrupt is a
> normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field
> with no breakout by IRQ. Identifying the right IRQ and subtracting it
> out later looks a lot uglier than the conditional stats accounting.

Sure. There are two ways to solve that:

1) Introduce a IRQ_NO_PER_CPU_STATS flag, mark the interrupt
   accordingly and make the stats increment conditional on it.
   The downside is that the conditional affects every interrupt.

2) Do something like this:

static inline
void __handle_percpu_irq(struct irq_desc *desc, irqreturn_t (*handle)(struct irq_desc *))
{
	struct irq_chip *chip = irq_desc_get_chip(desc);

	if (chip->irq_ack)
		chip->irq_ack(&desc->irq_data);

	handle(desc);

	if (chip->irq_eoi)
		chip->irq_eoi(&desc->irq_data);
}

void handle_percpu_irq(struct irq_desc *desc)
{
	/*
	 * PER CPU interrupts are not serialized. Do not touch
	 * desc->tot_count.
	 */
	__kstat_incr_irqs_this_cpu(desc);
	__handle_percpu_irq(desc, handle_irq_event_percpu);
}

void handle_percpu_irq_nostat(struct irq_desc *desc)
{
	__this_cpu_inc(desc->kstat_irqs->cnt);
	__handle_percpu_irq(desc, __handle_irq_event_percpu);
}
 
So that keeps the interrupt accounted for in /proc/interrupts. If you
don't want that remove the __this_cpu_inc() and mark the interrupt with
irq_set_status_flags(irq, IRQ_HIDDEN). That will exclude it from
/proc/interrupts too.

Thanks,

        tglx
Michael Kelley June 6, 2024, 2:34 p.m. UTC | #8
From: Thomas Gleixner <tglx@linutronix.de> Sent: Thursday, June 6, 2024 2:34 AM
> 
> On Thu, Jun 06 2024 at 03:14, Michael Kelley wrote:
> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 7:20 AM
> >>
> >> On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
> >> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM
> >> >
> >> > In /proc/interrupts, the double-counting isn't a problem, and is
> >> > potentially helpful as you say. But /proc/stat, for example, shows a total
> >> > interrupt count, which will be roughly double what it was before. That
> >> > /proc/stat value then shows up in user space in vmstat, for example.
> >> > That's what I was concerned about, though it's not a huge problem in
> >> > the grand scheme of things.
> >>
> >> That's trivial to solve. We can mark interrupts to be excluded from
> >> /proc/stat accounting.
> >>
> >
> > OK.  On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter
> > out the HYP interrupts. But what do you envision on arm64, where
> > there is no arch_irq_stat_cpu()?  On arm64, the top-level interrupt is a
> > normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field
> > with no breakout by IRQ. Identifying the right IRQ and subtracting it
> > out later looks a lot uglier than the conditional stats accounting.
> 
> Sure. There are two ways to solve that:
> 
> 1) Introduce a IRQ_NO_PER_CPU_STATS flag, mark the interrupt
>    accordingly and make the stats increment conditional on it.
>    The downside is that the conditional affects every interrupt.
> 
> 2) Do something like this:
> 
> static inline
> void __handle_percpu_irq(struct irq_desc *desc, irqreturn_t (*handle)(struct irq_desc
> *))
> {
> 	struct irq_chip *chip = irq_desc_get_chip(desc);
> 
> 	if (chip->irq_ack)
> 		chip->irq_ack(&desc->irq_data);
> 
> 	handle(desc);
> 
> 	if (chip->irq_eoi)
> 		chip->irq_eoi(&desc->irq_data);
> }
> 
> void handle_percpu_irq(struct irq_desc *desc)
> {
> 	/*
> 	 * PER CPU interrupts are not serialized. Do not touch
> 	 * desc->tot_count.
> 	 */
> 	__kstat_incr_irqs_this_cpu(desc);
> 	__handle_percpu_irq(desc, handle_irq_event_percpu);
> }
> 
> void handle_percpu_irq_nostat(struct irq_desc *desc)
> {
> 	__this_cpu_inc(desc->kstat_irqs->cnt);
> 	__handle_percpu_irq(desc, __handle_irq_event_percpu);
> }
> 
> So that keeps the interrupt accounted for in /proc/interrupts. If you
> don't want that remove the __this_cpu_inc() and mark the interrupt with
> irq_set_status_flags(irq, IRQ_HIDDEN). That will exclude it from
> /proc/interrupts too.
> 

Yes, this works for not double-counting in the first place. Account for the
control message interrupts in their own Linux IRQ. Then for the top-level
interrupt, instead of adding a new handler with conditional accounting,
add a new per-CPU handler that does no accounting. I had not noticed
the IRQ_HIDDEN flag, and that solves my concern about having an
entry in /proc/interrupts that always shows zero interrupts.  And with
no double-counting, the interrupt counts in /proc/stat won't be bloated.

On x86, I'll have to separately make the "HYP" line in /proc/interrupts
go away, but that's easy.

Thanks,

Michael
diff mbox series

Patch

diff --git a/include/linux/irq.h b/include/linux/irq.h
index a217e1029c1d..8825b95cefe5 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -656,6 +656,7 @@  extern void handle_edge_eoi_irq(struct irq_desc *desc);
 extern void handle_simple_irq(struct irq_desc *desc);
 extern void handle_untracked_irq(struct irq_desc *desc);
 extern void handle_percpu_irq(struct irq_desc *desc);
+extern void handle_percpu_demux_irq(struct irq_desc *desc);
 extern void handle_percpu_devid_irq(struct irq_desc *desc);
 extern void handle_bad_irq(struct irq_desc *desc);
 extern void handle_nested_irq(unsigned int irq);
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index dc94e0bf2c94..1f37a9db4a4d 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -910,6 +910,35 @@  void handle_percpu_irq(struct irq_desc *desc)
 		chip->irq_eoi(&desc->irq_data);
 }
 
+/**
+ *	handle_percpu_demux_irq - Per CPU local irq handler for IRQs
+ *	that may demultiplex to other IRQs
+ *	@desc:	the interrupt description structure for this irq
+ *
+ *	For per CPU interrupts on SMP machines without locking requirements.
+ *	Used for IRQs that may demultiplex to other IRQs that will do the
+ *     stats tracking and randomness. If the handler result indicates no
+ *	demultiplexing is done, account for the interrupt on this IRQ.
+ */
+void handle_percpu_demux_irq(struct irq_desc *desc)
+{
+	struct irq_chip *chip = irq_desc_get_chip(desc);
+
+	if (chip->irq_ack)
+		chip->irq_ack(&desc->irq_data);
+
+	if (__handle_irq_event_percpu(desc))
+		/*
+		 * PER CPU interrupts are not serialized.  Do not touch
+		 * desc->tot_count.
+		 */
+		__kstat_incr_irqs_this_cpu(desc);
+
+	if (chip->irq_eoi)
+		chip->irq_eoi(&desc->irq_data);
+}
+EXPORT_SYMBOL_GPL(handle_percpu_demux_irq);
+
 /**
  * handle_percpu_devid_irq - Per CPU local irq handler with per cpu dev ids
  * @desc:	the interrupt description structure for this irq