Message ID | 20240604050940.859909-7-mhklinux@outlook.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | Hyper-V guests use Linux IRQs for channel interrupts | expand |
Michael! On Mon, Jun 03 2024 at 22:09, mhkelley58@gmail.com wrote: > Hyper-V VMBus devices generate interrupts that are multiplexed > onto a single per-CPU architectural interrupt. The top-level VMBus > driver ISR demultiplexes these interrupts and invokes per-device > handlers. Currently, these per-device handlers are not modeled as > Linux IRQs, so /proc/interrupts shows all VMBus interrupts as accounted > to the top level architectural interrupt. Visibility into per-device > interrupt stats requires accessing VMBus-specific entries in sysfs. > The top-level VMBus driver ISR also handles management-related > interrupts that are not attributable to a particular VMBus device. > > As part of changing VMBus to model VMBus per-device handlers as > normal Linux IRQs, the top-level VMBus driver needs to conditionally > account for interrupts. If it passes the interrupt off to a > device-specific IRQ, the interrupt stats are done by that IRQ > handler, and accounting for the interrupt at the top level > is duplicative. But if it handles a management-related interrupt > itself, then it should account for the interrupt itself. > > Introduce a new flow handler that provides this functionality. > The new handler parallels handle_percpu_irq(), but does stats > only if the ISR returns other than IRQ_NONE. The existing > handle_untracked_irq() can't be used because it doesn't work for > per-cpu IRQs, and it doesn't provide conditional stats. There is a two other options to solve this: 1) Move the inner workings of handle_percpu_irq() out into a static function which returns the 'handled' value and share it between the two handler functions. 2) Allocate a proper interrupt for the management mode and invoke it via generic_handle_irq() just as any other demultiplex interrupt. That spares all the special casing in the core code and just works. Thanks, tglx
From: Thomas Gleixner <tglx@linutronix.de> Sent: Tuesday, June 4, 2024 11:14 AM > > Michael! > > On Mon, Jun 03 2024 at 22:09, mhkelley58@gmail.com wrote: > > Hyper-V VMBus devices generate interrupts that are multiplexed > > onto a single per-CPU architectural interrupt. The top-level VMBus > > driver ISR demultiplexes these interrupts and invokes per-device > > handlers. Currently, these per-device handlers are not modeled as > > Linux IRQs, so /proc/interrupts shows all VMBus interrupts as accounted > > to the top level architectural interrupt. Visibility into per-device > > interrupt stats requires accessing VMBus-specific entries in sysfs. > > The top-level VMBus driver ISR also handles management-related > > interrupts that are not attributable to a particular VMBus device. > > > > As part of changing VMBus to model VMBus per-device handlers as > > normal Linux IRQs, the top-level VMBus driver needs to conditionally > > account for interrupts. If it passes the interrupt off to a > > device-specific IRQ, the interrupt stats are done by that IRQ > > handler, and accounting for the interrupt at the top level > > is duplicative. But if it handles a management-related interrupt > > itself, then it should account for the interrupt itself. > > > > Introduce a new flow handler that provides this functionality. > > The new handler parallels handle_percpu_irq(), but does stats > > only if the ISR returns other than IRQ_NONE. The existing > > handle_untracked_irq() can't be used because it doesn't work for > > per-cpu IRQs, and it doesn't provide conditional stats. > > There is a two other options to solve this: > Thanks for taking a look. Unfortunately, unless I'm missing something, both options you suggest have downsides. > 1) Move the inner workings of handle_percpu_irq() out into > a static function which returns the 'handled' value and > share it between the two handler functions. The "inner workings" aren't quite the same in the two cases. handle_percpu_irq() uses handle_irq_event_percpu() while handle_percpu_demux_irq() uses __handle_irq_event_percpu(). The latter doesn't do add_interrupt_randomness() because the demultiplexed IRQ handler will do it. Doing add_interrupt_randomness() twice doesn't break anything, but it's more overhead in the hard irq path, which I'm trying to avoid. The extra functionality in the non-double-underscore version could be hoisted up to handle_percpu_irq(), but that offsets gains from sharing the inner workings. > > 2) Allocate a proper interrupt for the management mode and invoke it > via generic_handle_irq() just as any other demultiplex interrupt. > That spares all the special casing in the core code and just > works. Yes, this would work on x86, as the top-level interrupt isn't a Linux IRQ, and the interrupt counting is done in Hyper-V specific code that could be removed. The demux'ed interrupt does the counting. But on arm64 the top-level interrupt *is* a Linux IRQ, so each interrupt will get double-counted, which is a problem. Having to add handle_percpu_demux_irq() to handle arm64 correctly isn't as clean as I wish it could be. But I couldn't find a better approach. Michael
On Tue, Jun 04 2024 at 23:03, Michael Kelley wrote: > From: Thomas Gleixner <tglx@linutronix.de> Sent: Tuesday, June 4, 2024 11:14 AM >> 1) Move the inner workings of handle_percpu_irq() out into >> a static function which returns the 'handled' value and >> share it between the two handler functions. > > The "inner workings" aren't quite the same in the two cases. > handle_percpu_irq() uses handle_irq_event_percpu() while > handle_percpu_demux_irq() uses __handle_irq_event_percpu(). > The latter doesn't do add_interrupt_randomness() because the > demultiplexed IRQ handler will do it. Doing add_interrupt_randomness() > twice doesn't break anything, but it's more overhead in the hard irq > path, which I'm trying to avoid. The extra functionality in the > non-double-underscore version could be hoisted up to > handle_percpu_irq(), but that offsets gains from sharing the > inner workings. That's not rocket science to solve: static irqreturn_t helper(desc, func) { boiler_plate.. ret = func(desc) boiler_plate.. return ret; } No? TBH, I still hate that conditional accounting :) >> 2) Allocate a proper interrupt for the management mode and invoke it >> via generic_handle_irq() just as any other demultiplex interrupt. >> That spares all the special casing in the core code and just >> works. > > Yes, this would work on x86, as the top-level interrupt isn't a Linux IRQ, > and the interrupt counting is done in Hyper-V specific code that could be > removed. The demux'ed interrupt does the counting. > > But on arm64 the top-level interrupt *is* a Linux IRQ, so each > interrupt will get double-counted, which is a problem. What is the problem? You have: toplevel, mgmt, device[], right? They are all accounted for seperately and each toplevel interrupt might result in demultiplexing one or more interrupts (mgmt, device[]), no? IMO accounting the toplevel interrupt seperately is informative because it allows you to figure out whether demultiplexing is clustered or not, but I lost that argument long ago. That's why most demultiplex muck installs a chained handler, which is a design fail on it's own. Thanks, tglx
From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM > > On Tue, Jun 04 2024 at 23:03, Michael Kelley wrote: > > From: Thomas Gleixner <tglx@linutronix.de> Sent: Tuesday, June 4, 2024 11:14 AM > >> 1) Move the inner workings of handle_percpu_irq() out into > >> a static function which returns the 'handled' value and > >> share it between the two handler functions. > > > > The "inner workings" aren't quite the same in the two cases. > > handle_percpu_irq() uses handle_irq_event_percpu() while > > handle_percpu_demux_irq() uses __handle_irq_event_percpu(). > > The latter doesn't do add_interrupt_randomness() because the > > demultiplexed IRQ handler will do it. Doing add_interrupt_randomness() > > twice doesn't break anything, but it's more overhead in the hard irq > > path, which I'm trying to avoid. The extra functionality in the > > non-double-underscore version could be hoisted up to > > handle_percpu_irq(), but that offsets gains from sharing the > > inner workings. > > That's not rocket science to solve: > > static irqreturn_t helper(desc, func) > { > boiler_plate.. > ret = func(desc) > boiler_plate.. > return ret; > } > > No? > > TBH, I still hate that conditional accounting :) > > >> 2) Allocate a proper interrupt for the management mode and invoke it > >> via generic_handle_irq() just as any other demultiplex interrupt. > >> That spares all the special casing in the core code and just > >> works. > > > > Yes, this would work on x86, as the top-level interrupt isn't a Linux IRQ, > > and the interrupt counting is done in Hyper-V specific code that could be > > removed. The demux'ed interrupt does the counting. > > > > But on arm64 the top-level interrupt *is* a Linux IRQ, so each > > interrupt will get double-counted, which is a problem. > > What is the problem? > > You have: toplevel, mgmt, device[], right? > > They are all accounted for seperately and each toplevel interrupt might > result in demultiplexing one or more interrupts (mgmt, device[]), no? > > IMO accounting the toplevel interrupt seperately is informative because > it allows you to figure out whether demultiplexing is clustered or not, > but I lost that argument long ago. That's why most demultiplex muck > installs a chained handler, which is a design fail on it's own. In /proc/interrupts, the double-counting isn't a problem, and is potentially helpful as you say. But /proc/stat, for example, shows a total interrupt count, which will be roughly double what it was before. That /proc/stat value then shows up in user space in vmstat, for example. That's what I was concerned about, though it's not a huge problem in the grand scheme of things. Michael
On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote: > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM > > In /proc/interrupts, the double-counting isn't a problem, and is > potentially helpful as you say. But /proc/stat, for example, shows a total > interrupt count, which will be roughly double what it was before. That > /proc/stat value then shows up in user space in vmstat, for example. > That's what I was concerned about, though it's not a huge problem in > the grand scheme of things. That's trivial to solve. We can mark interrupts to be excluded from /proc/stat accounting. Thanks, tglx
From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 7:20 AM > > On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote: > > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM > > > > In /proc/interrupts, the double-counting isn't a problem, and is > > potentially helpful as you say. But /proc/stat, for example, shows a total > > interrupt count, which will be roughly double what it was before. That > > /proc/stat value then shows up in user space in vmstat, for example. > > That's what I was concerned about, though it's not a huge problem in > > the grand scheme of things. > > That's trivial to solve. We can mark interrupts to be excluded from > /proc/stat accounting. > OK. On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter out the HYP interrupts. But what do you envision on arm64, where there is no arch_irq_stat_cpu()? On arm64, the top-level interrupt is a normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field with no breakout by IRQ. Identifying the right IRQ and subtracting it out later looks a lot uglier than the conditional stats accounting. Or if there's some other approach I'm missing, please enlighten me! Michael
On Thu, Jun 06 2024 at 03:14, Michael Kelley wrote: > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 7:20 AM >> >> On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote: >> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM >> > >> > In /proc/interrupts, the double-counting isn't a problem, and is >> > potentially helpful as you say. But /proc/stat, for example, shows a total >> > interrupt count, which will be roughly double what it was before. That >> > /proc/stat value then shows up in user space in vmstat, for example. >> > That's what I was concerned about, though it's not a huge problem in >> > the grand scheme of things. >> >> That's trivial to solve. We can mark interrupts to be excluded from >> /proc/stat accounting. >> > > OK. On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter > out the HYP interrupts. But what do you envision on arm64, where > there is no arch_irq_stat_cpu()? On arm64, the top-level interrupt is a > normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field > with no breakout by IRQ. Identifying the right IRQ and subtracting it > out later looks a lot uglier than the conditional stats accounting. Sure. There are two ways to solve that: 1) Introduce a IRQ_NO_PER_CPU_STATS flag, mark the interrupt accordingly and make the stats increment conditional on it. The downside is that the conditional affects every interrupt. 2) Do something like this: static inline void __handle_percpu_irq(struct irq_desc *desc, irqreturn_t (*handle)(struct irq_desc *)) { struct irq_chip *chip = irq_desc_get_chip(desc); if (chip->irq_ack) chip->irq_ack(&desc->irq_data); handle(desc); if (chip->irq_eoi) chip->irq_eoi(&desc->irq_data); } void handle_percpu_irq(struct irq_desc *desc) { /* * PER CPU interrupts are not serialized. Do not touch * desc->tot_count. */ __kstat_incr_irqs_this_cpu(desc); __handle_percpu_irq(desc, handle_irq_event_percpu); } void handle_percpu_irq_nostat(struct irq_desc *desc) { __this_cpu_inc(desc->kstat_irqs->cnt); __handle_percpu_irq(desc, __handle_irq_event_percpu); } So that keeps the interrupt accounted for in /proc/interrupts. If you don't want that remove the __this_cpu_inc() and mark the interrupt with irq_set_status_flags(irq, IRQ_HIDDEN). That will exclude it from /proc/interrupts too. Thanks, tglx
From: Thomas Gleixner <tglx@linutronix.de> Sent: Thursday, June 6, 2024 2:34 AM > > On Thu, Jun 06 2024 at 03:14, Michael Kelley wrote: > > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 7:20 AM > >> > >> On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote: > >> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM > >> > > >> > In /proc/interrupts, the double-counting isn't a problem, and is > >> > potentially helpful as you say. But /proc/stat, for example, shows a total > >> > interrupt count, which will be roughly double what it was before. That > >> > /proc/stat value then shows up in user space in vmstat, for example. > >> > That's what I was concerned about, though it's not a huge problem in > >> > the grand scheme of things. > >> > >> That's trivial to solve. We can mark interrupts to be excluded from > >> /proc/stat accounting. > >> > > > > OK. On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter > > out the HYP interrupts. But what do you envision on arm64, where > > there is no arch_irq_stat_cpu()? On arm64, the top-level interrupt is a > > normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field > > with no breakout by IRQ. Identifying the right IRQ and subtracting it > > out later looks a lot uglier than the conditional stats accounting. > > Sure. There are two ways to solve that: > > 1) Introduce a IRQ_NO_PER_CPU_STATS flag, mark the interrupt > accordingly and make the stats increment conditional on it. > The downside is that the conditional affects every interrupt. > > 2) Do something like this: > > static inline > void __handle_percpu_irq(struct irq_desc *desc, irqreturn_t (*handle)(struct irq_desc > *)) > { > struct irq_chip *chip = irq_desc_get_chip(desc); > > if (chip->irq_ack) > chip->irq_ack(&desc->irq_data); > > handle(desc); > > if (chip->irq_eoi) > chip->irq_eoi(&desc->irq_data); > } > > void handle_percpu_irq(struct irq_desc *desc) > { > /* > * PER CPU interrupts are not serialized. Do not touch > * desc->tot_count. > */ > __kstat_incr_irqs_this_cpu(desc); > __handle_percpu_irq(desc, handle_irq_event_percpu); > } > > void handle_percpu_irq_nostat(struct irq_desc *desc) > { > __this_cpu_inc(desc->kstat_irqs->cnt); > __handle_percpu_irq(desc, __handle_irq_event_percpu); > } > > So that keeps the interrupt accounted for in /proc/interrupts. If you > don't want that remove the __this_cpu_inc() and mark the interrupt with > irq_set_status_flags(irq, IRQ_HIDDEN). That will exclude it from > /proc/interrupts too. > Yes, this works for not double-counting in the first place. Account for the control message interrupts in their own Linux IRQ. Then for the top-level interrupt, instead of adding a new handler with conditional accounting, add a new per-CPU handler that does no accounting. I had not noticed the IRQ_HIDDEN flag, and that solves my concern about having an entry in /proc/interrupts that always shows zero interrupts. And with no double-counting, the interrupt counts in /proc/stat won't be bloated. On x86, I'll have to separately make the "HYP" line in /proc/interrupts go away, but that's easy. Thanks, Michael
diff --git a/include/linux/irq.h b/include/linux/irq.h index a217e1029c1d..8825b95cefe5 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -656,6 +656,7 @@ extern void handle_edge_eoi_irq(struct irq_desc *desc); extern void handle_simple_irq(struct irq_desc *desc); extern void handle_untracked_irq(struct irq_desc *desc); extern void handle_percpu_irq(struct irq_desc *desc); +extern void handle_percpu_demux_irq(struct irq_desc *desc); extern void handle_percpu_devid_irq(struct irq_desc *desc); extern void handle_bad_irq(struct irq_desc *desc); extern void handle_nested_irq(unsigned int irq); diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index dc94e0bf2c94..1f37a9db4a4d 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -910,6 +910,35 @@ void handle_percpu_irq(struct irq_desc *desc) chip->irq_eoi(&desc->irq_data); } +/** + * handle_percpu_demux_irq - Per CPU local irq handler for IRQs + * that may demultiplex to other IRQs + * @desc: the interrupt description structure for this irq + * + * For per CPU interrupts on SMP machines without locking requirements. + * Used for IRQs that may demultiplex to other IRQs that will do the + * stats tracking and randomness. If the handler result indicates no + * demultiplexing is done, account for the interrupt on this IRQ. + */ +void handle_percpu_demux_irq(struct irq_desc *desc) +{ + struct irq_chip *chip = irq_desc_get_chip(desc); + + if (chip->irq_ack) + chip->irq_ack(&desc->irq_data); + + if (__handle_irq_event_percpu(desc)) + /* + * PER CPU interrupts are not serialized. Do not touch + * desc->tot_count. + */ + __kstat_incr_irqs_this_cpu(desc); + + if (chip->irq_eoi) + chip->irq_eoi(&desc->irq_data); +} +EXPORT_SYMBOL_GPL(handle_percpu_demux_irq); + /** * handle_percpu_devid_irq - Per CPU local irq handler with per cpu dev ids * @desc: the interrupt description structure for this irq