mbox series

[for-4.19,0/9] x86/irq: fixes for CPU hot{,un}plug

Message ID 20240529090132.59434-1-roger.pau@citrix.com (mailing list archive)
Headers show
Series x86/irq: fixes for CPU hot{,un}plug | expand

Message

Roger Pau Monné May 29, 2024, 9:01 a.m. UTC
Hello,

The following series aim to fix interrupt handling when doing CPU
plug/unplug operations.  Without this series running:

cpus=`xl info max_cpu_id`
while [ 1 ]; do
    for i in `seq 1 $cpus`; do
        xen-hptool cpu-offline $i;
        xen-hptool cpu-online $i;
    done
done

Quite quickly results in interrupts getting lost and "No irq handler for
vector" messages on the Xen console.  Drivers in dom0 also start getting
interrupt timeouts and the system becomes unusable.

After applying the series running the loop over night still result in a
fully usable system, no  "No irq handler for vector" messages at all, no
interrupt loses reported by dom0.  Test with
x2apic-mode={mixed,cluster}.

I'm tagging this for 4.19 as it's IMO bugfixes, but the series has grown
quite bigger than expected, and hence we need to be careful to not
introduce breakages late in the release cycle.  I've attempted to
document all code as good as I could, interrupt handling has some
unexpected corner cases that are hard to diagnose and reason about.

I'm currently also doing some extra testing with XenRT in case I've
missed something.

Thanks, Roger.

Roger Pau Monne (9):
  x86/irq: remove offline CPUs from old CPU mask when adjusting
    move_cleanup_count
  xen/cpu: do not get the CPU map in stop_machine_run()
  xen/cpu: ensure get_cpu_maps() returns false if CPU operations are
    underway
  x86/irq: describe how the interrupt CPU movement works
  x86/irq: limit interrupt movement done by fixup_irqs()
  x86/irq: restrict CPU movement in set_desc_affinity()
  x86/irq: deal with old_cpu_mask for interrupts in movement in
    fixup_irqs()
  x86/irq: handle moving interrupts in _assign_irq_vector()
  x86/irq: forward pending interrupts to new destination in fixup_irqs()

 xen/arch/x86/apic.c             |   5 +
 xen/arch/x86/include/asm/apic.h |   3 +
 xen/arch/x86/include/asm/irq.h  |  28 +++++-
 xen/arch/x86/irq.c              | 172 +++++++++++++++++++++++++-------
 xen/common/cpu.c                |   8 +-
 xen/common/stop_machine.c       |  15 +--
 xen/include/xen/cpu.h           |   2 +
 xen/include/xen/rwlock.h        |   2 +
 8 files changed, 190 insertions(+), 45 deletions(-)

Comments

Oleksii Kurochko May 29, 2024, 9:37 a.m. UTC | #1
On Wed, 2024-05-29 at 11:01 +0200, Roger Pau Monne wrote:
> Hello,
> 
> The following series aim to fix interrupt handling when doing CPU
> plug/unplug operations.  Without this series running:
> 
> cpus=`xl info max_cpu_id`
> while [ 1 ]; do
>     for i in `seq 1 $cpus`; do
>         xen-hptool cpu-offline $i;
>         xen-hptool cpu-online $i;
>     done
> done
> 
> Quite quickly results in interrupts getting lost and "No irq handler
> for
> vector" messages on the Xen console.  Drivers in dom0 also start
> getting
> interrupt timeouts and the system becomes unusable.
> 
> After applying the series running the loop over night still result in
> a
> fully usable system, no  "No irq handler for vector" messages at all,
> no
> interrupt loses reported by dom0.  Test with
> x2apic-mode={mixed,cluster}.
> 
> I'm tagging this for 4.19 as it's IMO bugfixes, but the series has
> grown
> quite bigger than expected, and hence we need to be careful to not
> introduce breakages late in the release cycle.  I've attempted to
> document all code as good as I could, interrupt handling has some
> unexpected corner cases that are hard to diagnose and reason about.
Despite of the fact that it can be considered as bugfixes, it seems to
me that this patch series can be risky. Let's wait for maintainers
opinion...

~ Oleksii
> 
> I'm currently also doing some extra testing with XenRT in case I've
> missed something.
> 
> Thanks, Roger.
> 
> Roger Pau Monne (9):
>   x86/irq: remove offline CPUs from old CPU mask when adjusting
>     move_cleanup_count
>   xen/cpu: do not get the CPU map in stop_machine_run()
>   xen/cpu: ensure get_cpu_maps() returns false if CPU operations are
>     underway
>   x86/irq: describe how the interrupt CPU movement works
>   x86/irq: limit interrupt movement done by fixup_irqs()
>   x86/irq: restrict CPU movement in set_desc_affinity()
>   x86/irq: deal with old_cpu_mask for interrupts in movement in
>     fixup_irqs()
>   x86/irq: handle moving interrupts in _assign_irq_vector()
>   x86/irq: forward pending interrupts to new destination in
> fixup_irqs()
> 
>  xen/arch/x86/apic.c             |   5 +
>  xen/arch/x86/include/asm/apic.h |   3 +
>  xen/arch/x86/include/asm/irq.h  |  28 +++++-
>  xen/arch/x86/irq.c              | 172 +++++++++++++++++++++++++-----
> --
>  xen/common/cpu.c                |   8 +-
>  xen/common/stop_machine.c       |  15 +--
>  xen/include/xen/cpu.h           |   2 +
>  xen/include/xen/rwlock.h        |   2 +
>  8 files changed, 190 insertions(+), 45 deletions(-)
>
Jan Beulich June 11, 2024, 1:25 p.m. UTC | #2
On 29.05.2024 11:37, Oleksii K. wrote:
> On Wed, 2024-05-29 at 11:01 +0200, Roger Pau Monne wrote:
>> Hello,
>>
>> The following series aim to fix interrupt handling when doing CPU
>> plug/unplug operations.  Without this series running:
>>
>> cpus=`xl info max_cpu_id`
>> while [ 1 ]; do
>>     for i in `seq 1 $cpus`; do
>>         xen-hptool cpu-offline $i;
>>         xen-hptool cpu-online $i;
>>     done
>> done
>>
>> Quite quickly results in interrupts getting lost and "No irq handler
>> for
>> vector" messages on the Xen console.  Drivers in dom0 also start
>> getting
>> interrupt timeouts and the system becomes unusable.
>>
>> After applying the series running the loop over night still result in
>> a
>> fully usable system, no  "No irq handler for vector" messages at all,
>> no
>> interrupt loses reported by dom0.  Test with
>> x2apic-mode={mixed,cluster}.
>>
>> I'm tagging this for 4.19 as it's IMO bugfixes, but the series has
>> grown
>> quite bigger than expected, and hence we need to be careful to not
>> introduce breakages late in the release cycle.  I've attempted to
>> document all code as good as I could, interrupt handling has some
>> unexpected corner cases that are hard to diagnose and reason about.
> Despite of the fact that it can be considered as bugfixes, it seems to
> me that this patch series can be risky. Let's wait for maintainers
> opinion...

Working my way through v2 of this series, I think I'd be okay with
including stuff there up to patch 5. Patch 6, which I just finished
taking a first look at, is likely correct (and it's just me missing some
aspects to fully grok the changes done there), but at the same time
looks to be more intrusive than we would like to have it at this point
of the release cycle. That said, I'd be pretty okay to be overridden in
this regard by Roger and/or Andrew.

Jan