mbox series

[v3,0/3] x86/irq: fixes for CPU hot{,un}plug

Message ID 20240613165617.42538-1-roger.pau@citrix.com (mailing list archive)
Headers show
Series x86/irq: fixes for CPU hot{,un}plug | expand

Message

Roger Pau Monné June 13, 2024, 4:56 p.m. UTC
Hello,

The following series aim to fix interrupt handling when doing CPU
plug/unplug operations.  Without this series running:

cpus=`xl info max_cpu_id`
while [ 1 ]; do
    for i in `seq 1 $cpus`; do
        xen-hptool cpu-offline $i;
        xen-hptool cpu-online $i;
    done
done

Quite quickly results in interrupts getting lost and "No irq handler for
vector" messages on the Xen console.  Drivers in dom0 also start getting
interrupt timeouts and the system becomes unusable.

After applying the series running the loop over night still result in a
fully usable system, no  "No irq handler for vector" messages at all, no
interrupt loses reported by dom0.  Test with x2apic-mode={mixed,cluster}.

I've attempted to document all code as good as I could, interrupt
handling has some unexpected corner cases that are hard to diagnose and
reason about.

Some XenRT testing is undergoing to ensure no breakages.

Thanks, Roger.

Roger Pau Monne (3):
  x86/irq: deal with old_cpu_mask for interrupts in movement in
    fixup_irqs()
  x86/irq: handle moving interrupts in _assign_irq_vector()
  x86/irq: forward pending interrupts to new destination in fixup_irqs()

 xen/arch/x86/include/asm/apic.h |   5 +
 xen/arch/x86/irq.c              | 163 +++++++++++++++++++++++++-------
 2 files changed, 132 insertions(+), 36 deletions(-)

Comments

Roger Pau Monné June 14, 2024, 7:28 a.m. UTC | #1
Sorry, forgot to add the for-4.19 tag and Cc Oleksii.

Since we have taken the start of the series, we might as well take the
remaining patches (if other x86 maintainers agree) and attempt to
hopefully fix all the interrupt issues with CPU hotplug/unplug.

FTR: there are further issues when doing CPU hotplug/unplug from a PVH
dom0, but those are out of the scope for 4.19, as I haven't even
started to diagnose what's going on.

Thanks, Roger.

On Thu, Jun 13, 2024 at 06:56:14PM +0200, Roger Pau Monne wrote:
> Hello,
> 
> The following series aim to fix interrupt handling when doing CPU
> plug/unplug operations.  Without this series running:
> 
> cpus=`xl info max_cpu_id`
> while [ 1 ]; do
>     for i in `seq 1 $cpus`; do
>         xen-hptool cpu-offline $i;
>         xen-hptool cpu-online $i;
>     done
> done
> 
> Quite quickly results in interrupts getting lost and "No irq handler for
> vector" messages on the Xen console.  Drivers in dom0 also start getting
> interrupt timeouts and the system becomes unusable.
> 
> After applying the series running the loop over night still result in a
> fully usable system, no  "No irq handler for vector" messages at all, no
> interrupt loses reported by dom0.  Test with x2apic-mode={mixed,cluster}.
> 
> I've attempted to document all code as good as I could, interrupt
> handling has some unexpected corner cases that are hard to diagnose and
> reason about.
> 
> Some XenRT testing is undergoing to ensure no breakages.
> 
> Thanks, Roger.
> 
> Roger Pau Monne (3):
>   x86/irq: deal with old_cpu_mask for interrupts in movement in
>     fixup_irqs()
>   x86/irq: handle moving interrupts in _assign_irq_vector()
>   x86/irq: forward pending interrupts to new destination in fixup_irqs()
> 
>  xen/arch/x86/include/asm/apic.h |   5 +
>  xen/arch/x86/irq.c              | 163 +++++++++++++++++++++++++-------
>  2 files changed, 132 insertions(+), 36 deletions(-)
> 
> -- 
> 2.45.2
>
Oleksii Kurochko June 14, 2024, 11:52 a.m. UTC | #2
On Fri, 2024-06-14 at 09:28 +0200, Roger Pau Monné wrote:
> Sorry, forgot to add the for-4.19 tag and Cc Oleksii.
> 
> Since we have taken the start of the series, we might as well take
> the
> remaining patches (if other x86 maintainers agree) and attempt to
> hopefully fix all the interrupt issues with CPU hotplug/unplug.
> 
> FTR: there are further issues when doing CPU hotplug/unplug from a
> PVH
> dom0, but those are out of the scope for 4.19, as I haven't even
> started to diagnose what's going on.
And this issues were before the current patch series was introduced?

~ Oleksii
> 
> Thanks, Roger.
> 
> On Thu, Jun 13, 2024 at 06:56:14PM +0200, Roger Pau Monne wrote:
> > Hello,
> > 
> > The following series aim to fix interrupt handling when doing CPU
> > plug/unplug operations.  Without this series running:
> > 
> > cpus=`xl info max_cpu_id`
> > while [ 1 ]; do
> >     for i in `seq 1 $cpus`; do
> >         xen-hptool cpu-offline $i;
> >         xen-hptool cpu-online $i;
> >     done
> > done
> > 
> > Quite quickly results in interrupts getting lost and "No irq
> > handler for
> > vector" messages on the Xen console.  Drivers in dom0 also start
> > getting
> > interrupt timeouts and the system becomes unusable.
> > 
> > After applying the series running the loop over night still result
> > in a
> > fully usable system, no  "No irq handler for vector" messages at
> > all, no
> > interrupt loses reported by dom0.  Test with x2apic-
> > mode={mixed,cluster}.
> > 
> > I've attempted to document all code as good as I could, interrupt
> > handling has some unexpected corner cases that are hard to diagnose
> > and
> > reason about.
> > 
> > Some XenRT testing is undergoing to ensure no breakages.
> > 
> > Thanks, Roger.
> > 
> > Roger Pau Monne (3):
> >   x86/irq: deal with old_cpu_mask for interrupts in movement in
> >     fixup_irqs()
> >   x86/irq: handle moving interrupts in _assign_irq_vector()
> >   x86/irq: forward pending interrupts to new destination in
> > fixup_irqs()
> > 
> >  xen/arch/x86/include/asm/apic.h |   5 +
> >  xen/arch/x86/irq.c              | 163 +++++++++++++++++++++++++---
> > ----
> >  2 files changed, 132 insertions(+), 36 deletions(-)
> > 
> > -- 
> > 2.45.2
> >
Roger Pau Monné June 14, 2024, 12:33 p.m. UTC | #3
On Fri, Jun 14, 2024 at 01:52:59PM +0200, Oleksii K. wrote:
> On Fri, 2024-06-14 at 09:28 +0200, Roger Pau Monné wrote:
> > Sorry, forgot to add the for-4.19 tag and Cc Oleksii.
> > 
> > Since we have taken the start of the series, we might as well take
> > the
> > remaining patches (if other x86 maintainers agree) and attempt to
> > hopefully fix all the interrupt issues with CPU hotplug/unplug.
> > 
> > FTR: there are further issues when doing CPU hotplug/unplug from a
> > PVH
> > dom0, but those are out of the scope for 4.19, as I haven't even
> > started to diagnose what's going on.
> And this issues were before the current patch series was introduced?

Sure, the issues with PVH dom0 cpu hotplug/unplug are additional to
the ones fixed here.

Thanks, Roger.