diff mbox series

[2/9] x86/IRQ: deal with move cleanup count state in fixup_irqs()

Message ID 5CC6DEA80200007800229E9D@prv1-mh.provo.novell.com (mailing list archive)
State Superseded
Headers show
Series x86: IRQ management adjustments | expand

Commit Message

Jan Beulich April 29, 2019, 11:23 a.m. UTC
The cleanup IPI may get sent immediately before a CPU gets removed from
the online map. In such a case the IPI would get handled on the CPU
being offlined no earlier than in the interrupts disabled window after
fixup_irqs()' main loop. This is too late, however, because a possible
affinity change may incur the need for vector assignment, which will
fail when the IRQ's move cleanup count is still non-zero.

To fix this
- record the set of CPUs the cleanup IPIs gets actually sent to alongside
  setting their count,
- adjust the count in fixup_irqs(), accounting for all CPUs that the
  cleanup IPI was sent to, but that are no longer online,
- bail early from the cleanup IPI handler when the CPU is no longer
  online, to prevent double accounting.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: The proper recording of the IPI destinations actually makes the
     move_cleanup_count field redundant. Do we want to drop it, at the
     price of a few more CPU-mask operations?

Comments

Roger Pau Monné May 3, 2019, 3:21 p.m. UTC | #1
On Mon, Apr 29, 2019 at 05:23:20AM -0600, Jan Beulich wrote:
> The cleanup IPI may get sent immediately before a CPU gets removed from
> the online map. In such a case the IPI would get handled on the CPU
> being offlined no earlier than in the interrupts disabled window after
> fixup_irqs()' main loop. This is too late, however, because a possible
> affinity change may incur the need for vector assignment, which will
> fail when the IRQ's move cleanup count is still non-zero.
> 
> To fix this
> - record the set of CPUs the cleanup IPIs gets actually sent to alongside
>   setting their count,
> - adjust the count in fixup_irqs(), accounting for all CPUs that the
>   cleanup IPI was sent to, but that are no longer online,
> - bail early from the cleanup IPI handler when the CPU is no longer
>   online, to prevent double accounting.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Just as a note, this whole interrupt migration business seems
extremely complex, and I wonder if Xen does really need it, or what's
exactly it's performance gain compared to more simple solutions. I
understand this is just fixes, but IMO it's making the logic even more
complex.

Maybe it would be simpler to have the interrupts hard-bound to pCPUs
and instead have a soft-affinity on the guest vCPUs that are assigned
as the destination?

> ---
> TBD: The proper recording of the IPI destinations actually makes the
>      move_cleanup_count field redundant. Do we want to drop it, at the
>      price of a few more CPU-mask operations?

AFAICT this is not a hot path, so I would remove the
move_cleanup_count field and just weight the cpu bitmap when needed.

Thanks, Roger.
Jan Beulich May 6, 2019, 7:44 a.m. UTC | #2
>>> On 03.05.19 at 17:21, <roger.pau@citrix.com> wrote:
> On Mon, Apr 29, 2019 at 05:23:20AM -0600, Jan Beulich wrote:
>> The cleanup IPI may get sent immediately before a CPU gets removed from
>> the online map. In such a case the IPI would get handled on the CPU
>> being offlined no earlier than in the interrupts disabled window after
>> fixup_irqs()' main loop. This is too late, however, because a possible
>> affinity change may incur the need for vector assignment, which will
>> fail when the IRQ's move cleanup count is still non-zero.
>> 
>> To fix this
>> - record the set of CPUs the cleanup IPIs gets actually sent to alongside
>>   setting their count,
>> - adjust the count in fixup_irqs(), accounting for all CPUs that the
>>   cleanup IPI was sent to, but that are no longer online,
>> - bail early from the cleanup IPI handler when the CPU is no longer
>>   online, to prevent double accounting.
>> 
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> Just as a note, this whole interrupt migration business seems
> extremely complex, and I wonder if Xen does really need it, or what's
> exactly it's performance gain compared to more simple solutions.

What more simple solutions would you think about? IRQ affinities
tracking their assigned-vCPU ones was added largely to avoid
high rate interrupts always arriving on a CPU other than the one
where the actual handling will take place. Arguably this may go
too far for low rate interrupts, but adding a respective heuristic
would rather further complicate handling.

> I understand this is just fixes, but IMO it's making the logic even more
> complex.
> 
> Maybe it would be simpler to have the interrupts hard-bound to pCPUs
> and instead have a soft-affinity on the guest vCPUs that are assigned
> as the destination?

How would the soft affinity of a vCPU be calculated that has
multiple IRQs (with at most partially overlapping affinities) to be
serviced by it?

>> ---
>> TBD: The proper recording of the IPI destinations actually makes the
>>      move_cleanup_count field redundant. Do we want to drop it, at the
>>      price of a few more CPU-mask operations?
> 
> AFAICT this is not a hot path, so I would remove the
> move_cleanup_count field and just weight the cpu bitmap when needed.

Added for v2 (pending successful testing).

Jan
Jan Beulich May 7, 2019, 7:28 a.m. UTC | #3
>>> On 03.05.19 at 17:21, <roger.pau@citrix.com> wrote:
> On Mon, Apr 29, 2019 at 05:23:20AM -0600, Jan Beulich wrote:
>> ---
>> TBD: The proper recording of the IPI destinations actually makes the
>>      move_cleanup_count field redundant. Do we want to drop it, at the
>>      price of a few more CPU-mask operations?
> 
> AFAICT this is not a hot path, so I would remove the
> move_cleanup_count field and just weight the cpu bitmap when needed.

FTR: It's not fully redundant - the patch removing it that I had
added was actually the reason for seeing the ASSERT() trigger
that you did ask to add in patch 1. The reason is that the field
serves as a flag for irq_move_cleanup_interrupt() whether to
act on an IRQ in the first place. Without it, the function will
prematurely clean up the vector, thus confusing subsequent
code trying to do the cleanup when it's actually due.

Jan
Roger Pau Monné May 7, 2019, 8:12 a.m. UTC | #4
On Tue, May 07, 2019 at 01:28:36AM -0600, Jan Beulich wrote:
> >>> On 03.05.19 at 17:21, <roger.pau@citrix.com> wrote:
> > On Mon, Apr 29, 2019 at 05:23:20AM -0600, Jan Beulich wrote:
> >> ---
> >> TBD: The proper recording of the IPI destinations actually makes the
> >>      move_cleanup_count field redundant. Do we want to drop it, at the
> >>      price of a few more CPU-mask operations?
> > 
> > AFAICT this is not a hot path, so I would remove the
> > move_cleanup_count field and just weight the cpu bitmap when needed.
> 
> FTR: It's not fully redundant - the patch removing it that I had
> added was actually the reason for seeing the ASSERT() trigger
> that you did ask to add in patch 1. The reason is that the field
> serves as a flag for irq_move_cleanup_interrupt() whether to
> act on an IRQ in the first place. Without it, the function will
> prematurely clean up the vector, thus confusing subsequent
> code trying to do the cleanup when it's actually due.

So weighing desc->arch.old_cpu_mask is not equivalent to
move_cleanup_count?

Thanks, Roger.
Jan Beulich May 7, 2019, 9:28 a.m. UTC | #5
>>> On 07.05.19 at 10:12, <roger.pau@citrix.com> wrote:
> On Tue, May 07, 2019 at 01:28:36AM -0600, Jan Beulich wrote:
>> >>> On 03.05.19 at 17:21, <roger.pau@citrix.com> wrote:
>> > On Mon, Apr 29, 2019 at 05:23:20AM -0600, Jan Beulich wrote:
>> >> ---
>> >> TBD: The proper recording of the IPI destinations actually makes the
>> >>      move_cleanup_count field redundant. Do we want to drop it, at the
>> >>      price of a few more CPU-mask operations?
>> > 
>> > AFAICT this is not a hot path, so I would remove the
>> > move_cleanup_count field and just weight the cpu bitmap when needed.
>> 
>> FTR: It's not fully redundant - the patch removing it that I had
>> added was actually the reason for seeing the ASSERT() trigger
>> that you did ask to add in patch 1. The reason is that the field
>> serves as a flag for irq_move_cleanup_interrupt() whether to
>> act on an IRQ in the first place. Without it, the function will
>> prematurely clean up the vector, thus confusing subsequent
>> code trying to do the cleanup when it's actually due.
> 
> So weighing desc->arch.old_cpu_mask is not equivalent to
> move_cleanup_count?

Not exactly, no: While the field gets set from the cpumask_weight()
result, it matter _when_ that happens. Prior to that point, what bits
are set in the mask is of no interest to irq_move_cleanup_interrupt().

Jan
diff mbox series

Patch

--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -658,6 +658,9 @@  void irq_move_cleanup_interrupt(struct c
     ack_APIC_irq();
 
     me = smp_processor_id();
+    if ( !cpu_online(me) )
+        return;
+
     for ( vector = FIRST_DYNAMIC_VECTOR;
           vector <= LAST_HIPRIORITY_VECTOR; vector++)
     {
@@ -717,11 +720,14 @@  unlock:
 
 static void send_cleanup_vector(struct irq_desc *desc)
 {
-    cpumask_t cleanup_mask;
+    cpumask_and(desc->arch.old_cpu_mask, desc->arch.old_cpu_mask,
+                &cpu_online_map);
+    desc->arch.move_cleanup_count = cpumask_weight(desc->arch.old_cpu_mask);
 
-    cpumask_and(&cleanup_mask, desc->arch.old_cpu_mask, &cpu_online_map);
-    desc->arch.move_cleanup_count = cpumask_weight(&cleanup_mask);
-    send_IPI_mask(&cleanup_mask, IRQ_MOVE_CLEANUP_VECTOR);
+    if ( desc->arch.move_cleanup_count )
+        send_IPI_mask(desc->arch.old_cpu_mask, IRQ_MOVE_CLEANUP_VECTOR);
+    else
+        release_old_vec(desc);
 
     desc->arch.move_in_progress = 0;
 }
@@ -2394,6 +2400,16 @@  void fixup_irqs(const cpumask_t *mask, b
              vector <= LAST_HIPRIORITY_VECTOR )
             cpumask_and(desc->arch.cpu_mask, desc->arch.cpu_mask, mask);
 
+        if ( desc->arch.move_cleanup_count )
+        {
+            /* The cleanup IPI may have got sent while we were still online. */
+            cpumask_andnot(&affinity, desc->arch.old_cpu_mask,
+                           &cpu_online_map);
+            desc->arch.move_cleanup_count -= cpumask_weight(&affinity);
+            if ( !desc->arch.move_cleanup_count )
+                release_old_vec(desc);
+        }
+
         cpumask_copy(&affinity, desc->affinity);
         if ( !desc->action || cpumask_subset(&affinity, mask) )
         {