Message ID | 20230410081438.1750-2-xin3.li@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: enable FRED for x86-64 | expand |
On Mon, Apr 10, 2023 at 01:14:06AM -0700, Xin Li wrote: > From: "H. Peter Anvin (Intel)" <hpa@zytor.com> > > IRQ_MOVE_CLEANUP_VECTOR is the only one of the system IRQ vectors that > is *below* FIRST_SYSTEM_VECTOR. It is a slow path, so just push it > into common_interrupt() just before the spurious interrupt handling. I'm missing here the "why": I can go forward into the set and see that you're splitting the handling based on vector types and there's event classification and the lowest prio vector is not going to be hardwired to 0x20, yadda yadda... but some of that should be in the text here so that it is clear why it is being done. Thx.
> > IRQ_MOVE_CLEANUP_VECTOR is the only one of the system IRQ vectors that > > is *below* FIRST_SYSTEM_VECTOR. It is a slow path, so just push it > > into common_interrupt() just before the spurious interrupt handling. > > I'm missing here the "why": > > I can go forward into the set and see that you're splitting the handling based on > vector types and there's event classification and the lowest prio vector is not > going to be hardwired to 0x20, yadda yadda... > > but some of that should be in the text here so that it is clear why it is being done. Per Dave's ask, I am adding details about the benefits that FRED introduces, and then why we make these changes, which should make it clearer. Thanks! Xin
On Mon, Apr 10 2023 at 01:14, Xin Li wrote: > IRQ_MOVE_CLEANUP_VECTOR is the only one of the system IRQ vectors that > is *below* FIRST_SYSTEM_VECTOR. It is a slow path, so just push it > into common_interrupt() just before the spurious interrupt handling. This is a complete NOOP on not FRED enabled systems as the IDT entry is still separate. So this change makes no sense outside of the FRED universe. Can we pretty please make this consistent? Aside of that the change comes with zero justification. I can see why this is done, i.e. to spare range checking in the FRED exception entry code, but that brings up an interesting question: IRQ_MOVE_CLEANUP_VECTOR is at vector 0x20 on purpose. 0x20 is the lowest priority vector so that the following (mostly theoretical) situation gets resolved: sysvec_irq_move_cleanup() if (is_pending_in_apic_IRR(vector_to_clean_up)) apic->send_IPI_self(IRQ_MOVE_CLEANUP_VECTOR); I.e. when for whatever reasons the to be cleaned up vector is still pending in the local APIC IRR the function retriggers IRQ_MOVE_CLEANUP_VECTOR and returns. As the pending to be cleaned up vector has higher priority it gets handled _before_ the cleanup vector. Otherwise this ends up in a live lock. So the question is whether FRED is changing that priority scheme or not. > @@ -248,6 +248,10 @@ DEFINE_IDTENTRY_IRQ(common_interrupt) > desc = __this_cpu_read(vector_irq[vector]); > if (likely(!IS_ERR_OR_NULL(desc))) { > handle_irq(desc, regs); > +#ifdef CONFIG_SMP > + } else if (vector == IRQ_MOVE_CLEANUP_VECTOR) { > + sysvec_irq_move_cleanup(regs); This nests IDTENTRY: common_interrupt() irqentry_enter(); kvm_set_cpu_l1tf_flush_l1d(); run_irq_on_irqstack_cond(__common_interrupt, ....) __common_interrupt() sysvec_irq_move_cleanup() irqentry_enter(); <- FAIL kvm_set_cpu_l1tf_flush_l1d(); <- WHY again? run_sysvec_on_irqstack_cond(__sysvec_irq_move_cleanup); __sysvec_irq_move_cleanup(); irqentry_exit(); It does not matter whether the cleanup vector is a slow path or not. Regular interrupts are not nesting, period. Exceptions nest, but IRQ_MOVE_CLEANUP_VECTOR is not an exception and we don't make an exception for it. Stop this mindless hackery and clean it up so it is correct for all variants. Thanks, tglx
On Sat, Jun 03 2023 at 22:51, Thomas Gleixner wrote: > On Mon, Apr 10 2023 at 01:14, Xin Li wrote: >> IRQ_MOVE_CLEANUP_VECTOR is the only one of the system IRQ vectors that >> is *below* FIRST_SYSTEM_VECTOR. It is a slow path, so just push it >> into common_interrupt() just before the spurious interrupt handling. > > This is a complete NOOP on not FRED enabled systems as the IDT entry is > still separate. So this change makes no sense outside of the FRED > universe. Can we pretty please make this consistent? The right thing to make this consistent is to get rid of this vector completely. There is zero reason for this to be an IPI. This can be delegated to a worker or whatever delayed mechanism. Thanks, tglx
On 6/5/23 10:07, Thomas Gleixner wrote: > On Sat, Jun 03 2023 at 22:51, Thomas Gleixner wrote: >> On Mon, Apr 10 2023 at 01:14, Xin Li wrote: >>> IRQ_MOVE_CLEANUP_VECTOR is the only one of the system IRQ vectors that >>> is *below* FIRST_SYSTEM_VECTOR. It is a slow path, so just push it >>> into common_interrupt() just before the spurious interrupt handling. >> >> This is a complete NOOP on not FRED enabled systems as the IDT entry is >> still separate. So this change makes no sense outside of the FRED >> universe. Can we pretty please make this consistent? > > The right thing to make this consistent is to get rid of this vector > completely. > > There is zero reason for this to be an IPI. This can be delegated to a > worker or whatever delayed mechanism. > As we discussed offline, I agree that this is a better solution (and should be a separate changeset before the FRED one.) -hpa
On Mon, Jun 05 2023 at 10:09, H. Peter Anvin wrote: > On 6/5/23 10:07, Thomas Gleixner wrote: >> There is zero reason for this to be an IPI. This can be delegated to a >> worker or whatever delayed mechanism. >> > As we discussed offline, I agree that this is a better solution (and > should be a separate changeset before the FRED one.) The untested below should do the trick. Wants to be split in several patches, but you get the idea. Thanks, tglx --- Subject: x86/vector: Get rid of IRQ_MOVE_CLEANUP_VECTOR From: Thomas Gleixner <tglx@linutronix.de> No point to waste a vector for cleaning up the leftovers of a moved interrupt. Aside of that this must be the lowest priority of all vectors which makes FRED systems utilizing vectors 0x10-0x1f more complicated than necessary. Schedule a timer instead. Not-Yet-Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/include/asm/hw_irq.h | 4 - arch/x86/include/asm/idtentry.h | 1 arch/x86/include/asm/irq_vectors.h | 7 --- arch/x86/kernel/apic/vector.c | 83 ++++++++++++++++++++++++++---------- arch/x86/kernel/idt.c | 1 arch/x86/platform/uv/uv_irq.c | 2 drivers/iommu/amd/iommu.c | 2 drivers/iommu/hyperv-iommu.c | 4 - drivers/iommu/intel/irq_remapping.c | 2 9 files changed, 68 insertions(+), 38 deletions(-) --- a/arch/x86/include/asm/hw_irq.h +++ b/arch/x86/include/asm/hw_irq.h @@ -97,10 +97,10 @@ extern struct irq_cfg *irqd_cfg(struct i extern void lock_vector_lock(void); extern void unlock_vector_lock(void); #ifdef CONFIG_SMP -extern void send_cleanup_vector(struct irq_cfg *); +extern void vector_schedule_cleanup(struct irq_cfg *); extern void irq_complete_move(struct irq_cfg *cfg); #else -static inline void send_cleanup_vector(struct irq_cfg *c) { } +static inline void vector_schedule_cleanup(struct irq_cfg *c) { } static inline void irq_complete_move(struct irq_cfg *c) { } #endif --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -648,7 +648,6 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI #ifdef CONFIG_SMP DECLARE_IDTENTRY(RESCHEDULE_VECTOR, sysvec_reschedule_ipi); -DECLARE_IDTENTRY_SYSVEC(IRQ_MOVE_CLEANUP_VECTOR, sysvec_irq_move_cleanup); DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR, sysvec_reboot); DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR, sysvec_call_function_single); DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR, sysvec_call_function); --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -35,13 +35,6 @@ */ #define FIRST_EXTERNAL_VECTOR 0x20 -/* - * Reserve the lowest usable vector (and hence lowest priority) 0x20 for - * triggering cleanup after irq migration. 0x21-0x2f will still be used - * for device interrupts. - */ -#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR - #define IA32_SYSCALL_VECTOR 0x80 /* --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -44,7 +44,18 @@ static cpumask_var_t vector_searchmask; static struct irq_chip lapic_controller; static struct irq_matrix *vector_matrix; #ifdef CONFIG_SMP -static DEFINE_PER_CPU(struct hlist_head, cleanup_list); + +static void vector_cleanup_callback(struct timer_list *tmr); + +struct vector_cleanup { + struct hlist_head head; + struct timer_list timer; +}; + +static DEFINE_PER_CPU(struct vector_cleanup, vector_cleanup) = { + .head = HLIST_HEAD_INIT, + .timer = __TIMER_INITIALIZER(vector_cleanup_callback, TIMER_PINNED), +}; #endif void lock_vector_lock(void) @@ -843,8 +854,12 @@ void lapic_online(void) void lapic_offline(void) { + struct vector_cleanup *cl = this_cpu_ptr(&vector_cleanup); + lock_vector_lock(); irq_matrix_offline(vector_matrix); + WARN_ON_ONCE(try_to_del_timer_sync(&cl->timer) < 0); + WARN_ON_ONCE(!hlist_empty(&cl->head)); unlock_vector_lock(); } @@ -934,62 +949,86 @@ static void free_moved_vector(struct api apicd->move_in_progress = 0; } -DEFINE_IDTENTRY_SYSVEC(sysvec_irq_move_cleanup) +static void vector_cleanup_callback(struct timer_list *tmr) { - struct hlist_head *clhead = this_cpu_ptr(&cleanup_list); + struct vector_cleanup *cl = container_of(tmr, typeof(*cl), timer); struct apic_chip_data *apicd; struct hlist_node *tmp; + bool rearm = false; - ack_APIC_irq(); /* Prevent vectors vanishing under us */ - raw_spin_lock(&vector_lock); + raw_spin_lock_irq(&vector_lock); - hlist_for_each_entry_safe(apicd, tmp, clhead, clist) { + hlist_for_each_entry_safe(apicd, tmp, &cl->head, clist) { unsigned int irr, vector = apicd->prev_vector; /* * Paranoia: Check if the vector that needs to be cleaned - * up is registered at the APICs IRR. If so, then this is - * not the best time to clean it up. Clean it up in the - * next attempt by sending another IRQ_MOVE_CLEANUP_VECTOR - * to this CPU. IRQ_MOVE_CLEANUP_VECTOR is the lowest - * priority external vector, so on return from this - * interrupt the device interrupt will happen first. + * up is registered at the APICs IRR. That's clearly a + * hardware issue if the vector arrived on the old target + * _after_ interrupts were disabled above. Keep @apicd + * on the list and schedule the timer again to give the CPU + * a chance to handle the pending interrupt. */ irr = apic_read(APIC_IRR + (vector / 32 * 0x10)); if (irr & (1U << (vector % 32))) { - apic->send_IPI_self(IRQ_MOVE_CLEANUP_VECTOR); + pr_warn_once("Moved interrupt pending in old target APIC %u\n", apicd->irq); + rearm = true; continue; } free_moved_vector(apicd); } - raw_spin_unlock(&vector_lock); + /* + * Must happen under vector_lock to make the timer_pending() check + * in __vector_schedule_cleanup() race free against the rearm here. + */ + if (rearm) + mod_timer(tmr, jiffies + 1); + + raw_spin_unlock_irq(&vector_lock); } -static void __send_cleanup_vector(struct apic_chip_data *apicd) +static void __vector_schedule_cleanup(struct apic_chip_data *apicd) { - unsigned int cpu; + unsigned int cpu = apicd->prev_cpu; raw_spin_lock(&vector_lock); apicd->move_in_progress = 0; - cpu = apicd->prev_cpu; if (cpu_online(cpu)) { - hlist_add_head(&apicd->clist, per_cpu_ptr(&cleanup_list, cpu)); - apic->send_IPI(cpu, IRQ_MOVE_CLEANUP_VECTOR); + struct vector_cleanup *cl = per_cpu_ptr(&vector_cleanup, cpu); + + /* + * The lockless timer_pending() check is safe here. If it + * returns true, then the callback will observe this new + * apic data in the hlist as everything is serialized by + * vector lock. + * + * If it returns false then the timer is either not armed + * or the other CPU executes the callback, which again + * would be blocked on vector lock. Rearming it in the + * latter case makes it fire for nothing. + * + * This is also safe against the callback rearming the timer + * because that's serialized via vector lock too. + */ + if (!timer_pending(&cl->timer)) { + cl->timer.expires = jiffies + 1; + add_timer_on(&cl->timer, cpu); + } } else { apicd->prev_vector = 0; } raw_spin_unlock(&vector_lock); } -void send_cleanup_vector(struct irq_cfg *cfg) +void vector_schedule_cleanup(struct irq_cfg *cfg) { struct apic_chip_data *apicd; apicd = container_of(cfg, struct apic_chip_data, hw_irq_cfg); if (apicd->move_in_progress) - __send_cleanup_vector(apicd); + __vector_schedule_cleanup(apicd); } void irq_complete_move(struct irq_cfg *cfg) @@ -1007,7 +1046,7 @@ void irq_complete_move(struct irq_cfg *c * on the same CPU. */ if (apicd->cpu == smp_processor_id()) - __send_cleanup_vector(apicd); + __vector_schedule_cleanup(apicd); } /* --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -131,7 +131,6 @@ static const __initconst struct idt_data INTG(RESCHEDULE_VECTOR, asm_sysvec_reschedule_ipi), INTG(CALL_FUNCTION_VECTOR, asm_sysvec_call_function), INTG(CALL_FUNCTION_SINGLE_VECTOR, asm_sysvec_call_function_single), - INTG(IRQ_MOVE_CLEANUP_VECTOR, asm_sysvec_irq_move_cleanup), INTG(REBOOT_VECTOR, asm_sysvec_reboot), #endif --- a/arch/x86/platform/uv/uv_irq.c +++ b/arch/x86/platform/uv/uv_irq.c @@ -58,7 +58,7 @@ uv_set_irq_affinity(struct irq_data *dat ret = parent->chip->irq_set_affinity(parent, mask, force); if (ret >= 0) { uv_program_mmr(cfg, data->chip_data); - send_cleanup_vector(cfg); + vector_schedule_cleanup(cfg); } return ret; --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -3639,7 +3639,7 @@ static int amd_ir_set_affinity(struct ir * at the new destination. So, time to cleanup the previous * vector allocation. */ - send_cleanup_vector(cfg); + vector_schedule_cleanup(cfg); return IRQ_SET_MASK_OK_DONE; } --- a/drivers/iommu/hyperv-iommu.c +++ b/drivers/iommu/hyperv-iommu.c @@ -51,7 +51,7 @@ static int hyperv_ir_set_affinity(struct if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE) return ret; - send_cleanup_vector(cfg); + vector_schedule_cleanup(cfg); return 0; } @@ -257,7 +257,7 @@ static int hyperv_root_ir_set_affinity(s if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE) return ret; - send_cleanup_vector(cfg); + vector_schedule_cleanup(cfg); return 0; } --- a/drivers/iommu/intel/irq_remapping.c +++ b/drivers/iommu/intel/irq_remapping.c @@ -1180,7 +1180,7 @@ intel_ir_set_affinity(struct irq_data *d * at the new destination. So, time to cleanup the previous * vector allocation. */ - send_cleanup_vector(cfg); + vector_schedule_cleanup(cfg); return IRQ_SET_MASK_OK_DONE; }
> The untested below should do the trick. Wants to be split in several patches, but > you get the idea. I will continue the work from what you posted. Thanks a lot! Xin > --- > Subject: x86/vector: Get rid of IRQ_MOVE_CLEANUP_VECTOR > From: Thomas Gleixner <tglx@linutronix.de> > > No point to waste a vector for cleaning up the leftovers of a moved interrupt. > Aside of that this must be the lowest priority of all vectors which makes FRED > systems utilizing vectors 0x10-0x1f more complicated than necessary. > > Schedule a timer instead. > > Not-Yet-Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > --- > arch/x86/include/asm/hw_irq.h | 4 - > arch/x86/include/asm/idtentry.h | 1 > arch/x86/include/asm/irq_vectors.h | 7 --- > arch/x86/kernel/apic/vector.c | 83 ++++++++++++++++++++++++++---------- > arch/x86/kernel/idt.c | 1 > arch/x86/platform/uv/uv_irq.c | 2 > drivers/iommu/amd/iommu.c | 2 > drivers/iommu/hyperv-iommu.c | 4 - > drivers/iommu/intel/irq_remapping.c | 2 > 9 files changed, 68 insertions(+), 38 deletions(-) > > --- a/arch/x86/include/asm/hw_irq.h > +++ b/arch/x86/include/asm/hw_irq.h > @@ -97,10 +97,10 @@ extern struct irq_cfg *irqd_cfg(struct i extern void > lock_vector_lock(void); extern void unlock_vector_lock(void); #ifdef > CONFIG_SMP -extern void send_cleanup_vector(struct irq_cfg *); > +extern void vector_schedule_cleanup(struct irq_cfg *); > extern void irq_complete_move(struct irq_cfg *cfg); #else -static inline void > send_cleanup_vector(struct irq_cfg *c) { } > +static inline void vector_schedule_cleanup(struct irq_cfg *c) { } > static inline void irq_complete_move(struct irq_cfg *c) { } #endif > > --- a/arch/x86/include/asm/idtentry.h > +++ b/arch/x86/include/asm/idtentry.h > @@ -648,7 +648,6 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI > > #ifdef CONFIG_SMP > DECLARE_IDTENTRY(RESCHEDULE_VECTOR, > sysvec_reschedule_ipi); > -DECLARE_IDTENTRY_SYSVEC(IRQ_MOVE_CLEANUP_VECTOR, > sysvec_irq_move_cleanup); > DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR, > sysvec_reboot); > DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR, > sysvec_call_function_single); > DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR, > sysvec_call_function); > --- a/arch/x86/include/asm/irq_vectors.h > +++ b/arch/x86/include/asm/irq_vectors.h > @@ -35,13 +35,6 @@ > */ > #define FIRST_EXTERNAL_VECTOR 0x20 > > -/* > - * Reserve the lowest usable vector (and hence lowest priority) 0x20 for > - * triggering cleanup after irq migration. 0x21-0x2f will still be used > - * for device interrupts. > - */ > -#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR > - > #define IA32_SYSCALL_VECTOR 0x80 > > /* > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -44,7 +44,18 @@ static cpumask_var_t vector_searchmask; static struct > irq_chip lapic_controller; static struct irq_matrix *vector_matrix; #ifdef > CONFIG_SMP -static DEFINE_PER_CPU(struct hlist_head, cleanup_list); > + > +static void vector_cleanup_callback(struct timer_list *tmr); > + > +struct vector_cleanup { > + struct hlist_head head; > + struct timer_list timer; > +}; > + > +static DEFINE_PER_CPU(struct vector_cleanup, vector_cleanup) = { > + .head = HLIST_HEAD_INIT, > + .timer = __TIMER_INITIALIZER(vector_cleanup_callback, > TIMER_PINNED), > +}; > #endif > > void lock_vector_lock(void) > @@ -843,8 +854,12 @@ void lapic_online(void) > > void lapic_offline(void) > { > + struct vector_cleanup *cl = this_cpu_ptr(&vector_cleanup); > + > lock_vector_lock(); > irq_matrix_offline(vector_matrix); > + WARN_ON_ONCE(try_to_del_timer_sync(&cl->timer) < 0); > + WARN_ON_ONCE(!hlist_empty(&cl->head)); > unlock_vector_lock(); > } > > @@ -934,62 +949,86 @@ static void free_moved_vector(struct api > apicd->move_in_progress = 0; > } > > -DEFINE_IDTENTRY_SYSVEC(sysvec_irq_move_cleanup) > +static void vector_cleanup_callback(struct timer_list *tmr) > { > - struct hlist_head *clhead = this_cpu_ptr(&cleanup_list); > + struct vector_cleanup *cl = container_of(tmr, typeof(*cl), timer); > struct apic_chip_data *apicd; > struct hlist_node *tmp; > + bool rearm = false; > > - ack_APIC_irq(); > /* Prevent vectors vanishing under us */ > - raw_spin_lock(&vector_lock); > + raw_spin_lock_irq(&vector_lock); > > - hlist_for_each_entry_safe(apicd, tmp, clhead, clist) { > + hlist_for_each_entry_safe(apicd, tmp, &cl->head, clist) { > unsigned int irr, vector = apicd->prev_vector; > > /* > * Paranoia: Check if the vector that needs to be cleaned > - * up is registered at the APICs IRR. If so, then this is > - * not the best time to clean it up. Clean it up in the > - * next attempt by sending another > IRQ_MOVE_CLEANUP_VECTOR > - * to this CPU. IRQ_MOVE_CLEANUP_VECTOR is the lowest > - * priority external vector, so on return from this > - * interrupt the device interrupt will happen first. > + * up is registered at the APICs IRR. That's clearly a > + * hardware issue if the vector arrived on the old target > + * _after_ interrupts were disabled above. Keep @apicd > + * on the list and schedule the timer again to give the CPU > + * a chance to handle the pending interrupt. > */ > irr = apic_read(APIC_IRR + (vector / 32 * 0x10)); > if (irr & (1U << (vector % 32))) { > - apic->send_IPI_self(IRQ_MOVE_CLEANUP_VECTOR); > + pr_warn_once("Moved interrupt pending in old target > APIC %u\n", apicd->irq); > + rearm = true; > continue; > } > free_moved_vector(apicd); > } > > - raw_spin_unlock(&vector_lock); > + /* > + * Must happen under vector_lock to make the timer_pending() check > + * in __vector_schedule_cleanup() race free against the rearm here. > + */ > + if (rearm) > + mod_timer(tmr, jiffies + 1); > + > + raw_spin_unlock_irq(&vector_lock); > } > > -static void __send_cleanup_vector(struct apic_chip_data *apicd) > +static void __vector_schedule_cleanup(struct apic_chip_data *apicd) > { > - unsigned int cpu; > + unsigned int cpu = apicd->prev_cpu; > > raw_spin_lock(&vector_lock); > apicd->move_in_progress = 0; > - cpu = apicd->prev_cpu; > if (cpu_online(cpu)) { > - hlist_add_head(&apicd->clist, per_cpu_ptr(&cleanup_list, cpu)); > - apic->send_IPI(cpu, IRQ_MOVE_CLEANUP_VECTOR); > + struct vector_cleanup *cl = per_cpu_ptr(&vector_cleanup, cpu); > + > + /* > + * The lockless timer_pending() check is safe here. If it > + * returns true, then the callback will observe this new > + * apic data in the hlist as everything is serialized by > + * vector lock. > + * > + * If it returns false then the timer is either not armed > + * or the other CPU executes the callback, which again > + * would be blocked on vector lock. Rearming it in the > + * latter case makes it fire for nothing. > + * > + * This is also safe against the callback rearming the timer > + * because that's serialized via vector lock too. > + */ > + if (!timer_pending(&cl->timer)) { > + cl->timer.expires = jiffies + 1; > + add_timer_on(&cl->timer, cpu); > + } > } else { > apicd->prev_vector = 0; > } > raw_spin_unlock(&vector_lock); > } > > -void send_cleanup_vector(struct irq_cfg *cfg) > +void vector_schedule_cleanup(struct irq_cfg *cfg) > { > struct apic_chip_data *apicd; > > apicd = container_of(cfg, struct apic_chip_data, hw_irq_cfg); > if (apicd->move_in_progress) > - __send_cleanup_vector(apicd); > + __vector_schedule_cleanup(apicd); > } > > void irq_complete_move(struct irq_cfg *cfg) @@ -1007,7 +1046,7 @@ void > irq_complete_move(struct irq_cfg *c > * on the same CPU. > */ > if (apicd->cpu == smp_processor_id()) > - __send_cleanup_vector(apicd); > + __vector_schedule_cleanup(apicd); > } > > /* > --- a/arch/x86/kernel/idt.c > +++ b/arch/x86/kernel/idt.c > @@ -131,7 +131,6 @@ static const __initconst struct idt_data > INTG(RESCHEDULE_VECTOR, > asm_sysvec_reschedule_ipi), > INTG(CALL_FUNCTION_VECTOR, > asm_sysvec_call_function), > INTG(CALL_FUNCTION_SINGLE_VECTOR, > asm_sysvec_call_function_single), > - INTG(IRQ_MOVE_CLEANUP_VECTOR, > asm_sysvec_irq_move_cleanup), > INTG(REBOOT_VECTOR, asm_sysvec_reboot), > #endif > > --- a/arch/x86/platform/uv/uv_irq.c > +++ b/arch/x86/platform/uv/uv_irq.c > @@ -58,7 +58,7 @@ uv_set_irq_affinity(struct irq_data *dat > ret = parent->chip->irq_set_affinity(parent, mask, force); > if (ret >= 0) { > uv_program_mmr(cfg, data->chip_data); > - send_cleanup_vector(cfg); > + vector_schedule_cleanup(cfg); > } > > return ret; > --- a/drivers/iommu/amd/iommu.c > +++ b/drivers/iommu/amd/iommu.c > @@ -3639,7 +3639,7 @@ static int amd_ir_set_affinity(struct ir > * at the new destination. So, time to cleanup the previous > * vector allocation. > */ > - send_cleanup_vector(cfg); > + vector_schedule_cleanup(cfg); > > return IRQ_SET_MASK_OK_DONE; > } > --- a/drivers/iommu/hyperv-iommu.c > +++ b/drivers/iommu/hyperv-iommu.c > @@ -51,7 +51,7 @@ static int hyperv_ir_set_affinity(struct > if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE) > return ret; > > - send_cleanup_vector(cfg); > + vector_schedule_cleanup(cfg); > > return 0; > } > @@ -257,7 +257,7 @@ static int hyperv_root_ir_set_affinity(s > if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE) > return ret; > > - send_cleanup_vector(cfg); > + vector_schedule_cleanup(cfg); > > return 0; > } > --- a/drivers/iommu/intel/irq_remapping.c > +++ b/drivers/iommu/intel/irq_remapping.c > @@ -1180,7 +1180,7 @@ intel_ir_set_affinity(struct irq_data *d > * at the new destination. So, time to cleanup the previous > * vector allocation. > */ > - send_cleanup_vector(cfg); > + vector_schedule_cleanup(cfg); > > return IRQ_SET_MASK_OK_DONE; > }
> The untested below should do the trick. Wants to be split in several > patches, but you get the idea. Hi Thomas, How do you want to split the patch? To me it's better to keep the changes in one patch, thus the differences are more obvious. We need a second patch to do vector cleanup in lapic_offline() in case the vector cleanup timer has not expired. But I don't see a good way to split your draft patch (I only have a minor fix on it). Thanks! Xin > > Thanks, > > tglx > --- > Subject: x86/vector: Get rid of IRQ_MOVE_CLEANUP_VECTOR > From: Thomas Gleixner <tglx@linutronix.de> > > No point to waste a vector for cleaning up the leftovers of a moved > interrupt. Aside of that this must be the lowest priority of all vectors > which makes FRED systems utilizing vectors 0x10-0x1f more complicated > than necessary. > > Schedule a timer instead. > > Not-Yet-Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > --- > arch/x86/include/asm/hw_irq.h | 4 - > arch/x86/include/asm/idtentry.h | 1 > arch/x86/include/asm/irq_vectors.h | 7 --- > arch/x86/kernel/apic/vector.c | 83 ++++++++++++++++++++++++++---------- > arch/x86/kernel/idt.c | 1 > arch/x86/platform/uv/uv_irq.c | 2 > drivers/iommu/amd/iommu.c | 2 > drivers/iommu/hyperv-iommu.c | 4 - > drivers/iommu/intel/irq_remapping.c | 2 > 9 files changed, 68 insertions(+), 38 deletions(-) > > --- a/arch/x86/include/asm/hw_irq.h > +++ b/arch/x86/include/asm/hw_irq.h > @@ -97,10 +97,10 @@ extern struct irq_cfg *irqd_cfg(struct i > extern void lock_vector_lock(void); > extern void unlock_vector_lock(void); > #ifdef CONFIG_SMP > -extern void send_cleanup_vector(struct irq_cfg *); > +extern void vector_schedule_cleanup(struct irq_cfg *); > extern void irq_complete_move(struct irq_cfg *cfg); > #else > -static inline void send_cleanup_vector(struct irq_cfg *c) { } > +static inline void vector_schedule_cleanup(struct irq_cfg *c) { } > static inline void irq_complete_move(struct irq_cfg *c) { } > #endif > > --- a/arch/x86/include/asm/idtentry.h > +++ b/arch/x86/include/asm/idtentry.h > @@ -648,7 +648,6 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI > > #ifdef CONFIG_SMP > DECLARE_IDTENTRY(RESCHEDULE_VECTOR, > sysvec_reschedule_ipi); > -DECLARE_IDTENTRY_SYSVEC(IRQ_MOVE_CLEANUP_VECTOR, > sysvec_irq_move_cleanup); > DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR, sysvec_reboot); > DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR, > sysvec_call_function_single); > DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR, > sysvec_call_function); > --- a/arch/x86/include/asm/irq_vectors.h > +++ b/arch/x86/include/asm/irq_vectors.h > @@ -35,13 +35,6 @@ > */ > #define FIRST_EXTERNAL_VECTOR 0x20 > > -/* > - * Reserve the lowest usable vector (and hence lowest priority) 0x20 for > - * triggering cleanup after irq migration. 0x21-0x2f will still be used > - * for device interrupts. > - */ > -#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR > - > #define IA32_SYSCALL_VECTOR 0x80 > > /* > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -44,7 +44,18 @@ static cpumask_var_t vector_searchmask; > static struct irq_chip lapic_controller; > static struct irq_matrix *vector_matrix; > #ifdef CONFIG_SMP > -static DEFINE_PER_CPU(struct hlist_head, cleanup_list); > + > +static void vector_cleanup_callback(struct timer_list *tmr); > + > +struct vector_cleanup { > + struct hlist_head head; > + struct timer_list timer; > +}; > + > +static DEFINE_PER_CPU(struct vector_cleanup, vector_cleanup) = { > + .head = HLIST_HEAD_INIT, > + .timer = __TIMER_INITIALIZER(vector_cleanup_callback, TIMER_PINNED), > +}; > #endif > > void lock_vector_lock(void) > @@ -843,8 +854,12 @@ void lapic_online(void) > > void lapic_offline(void) > { > + struct vector_cleanup *cl = this_cpu_ptr(&vector_cleanup); > + > lock_vector_lock(); > irq_matrix_offline(vector_matrix); > + WARN_ON_ONCE(try_to_del_timer_sync(&cl->timer) < 0); > + WARN_ON_ONCE(!hlist_empty(&cl->head)); > unlock_vector_lock(); > } > > @@ -934,62 +949,86 @@ static void free_moved_vector(struct api > apicd->move_in_progress = 0; > } > > -DEFINE_IDTENTRY_SYSVEC(sysvec_irq_move_cleanup) > +static void vector_cleanup_callback(struct timer_list *tmr) > { > - struct hlist_head *clhead = this_cpu_ptr(&cleanup_list); > + struct vector_cleanup *cl = container_of(tmr, typeof(*cl), timer); > struct apic_chip_data *apicd; > struct hlist_node *tmp; > + bool rearm = false; > > - ack_APIC_irq(); > /* Prevent vectors vanishing under us */ > - raw_spin_lock(&vector_lock); > + raw_spin_lock_irq(&vector_lock); > > - hlist_for_each_entry_safe(apicd, tmp, clhead, clist) { > + hlist_for_each_entry_safe(apicd, tmp, &cl->head, clist) { > unsigned int irr, vector = apicd->prev_vector; > > /* > * Paranoia: Check if the vector that needs to be cleaned > - * up is registered at the APICs IRR. If so, then this is > - * not the best time to clean it up. Clean it up in the > - * next attempt by sending another IRQ_MOVE_CLEANUP_VECTOR > - * to this CPU. IRQ_MOVE_CLEANUP_VECTOR is the lowest > - * priority external vector, so on return from this > - * interrupt the device interrupt will happen first. > + * up is registered at the APICs IRR. That's clearly a > + * hardware issue if the vector arrived on the old target > + * _after_ interrupts were disabled above. Keep @apicd > + * on the list and schedule the timer again to give the CPU > + * a chance to handle the pending interrupt. > */ > irr = apic_read(APIC_IRR + (vector / 32 * 0x10)); > if (irr & (1U << (vector % 32))) { > - apic->send_IPI_self(IRQ_MOVE_CLEANUP_VECTOR); > + pr_warn_once("Moved interrupt pending in old target APIC > %u\n", apicd->irq); > + rearm = true; > continue; > } > free_moved_vector(apicd); > } > > - raw_spin_unlock(&vector_lock); > + /* > + * Must happen under vector_lock to make the timer_pending() check > + * in __vector_schedule_cleanup() race free against the rearm here. > + */ > + if (rearm) > + mod_timer(tmr, jiffies + 1); > + > + raw_spin_unlock_irq(&vector_lock); > } > > -static void __send_cleanup_vector(struct apic_chip_data *apicd) > +static void __vector_schedule_cleanup(struct apic_chip_data *apicd) > { > - unsigned int cpu; > + unsigned int cpu = apicd->prev_cpu; > > raw_spin_lock(&vector_lock); > apicd->move_in_progress = 0; > - cpu = apicd->prev_cpu; > if (cpu_online(cpu)) { > - hlist_add_head(&apicd->clist, per_cpu_ptr(&cleanup_list, cpu)); > - apic->send_IPI(cpu, IRQ_MOVE_CLEANUP_VECTOR); > + struct vector_cleanup *cl = per_cpu_ptr(&vector_cleanup, cpu); > + > + /* > + * The lockless timer_pending() check is safe here. If it > + * returns true, then the callback will observe this new > + * apic data in the hlist as everything is serialized by > + * vector lock. > + * > + * If it returns false then the timer is either not armed > + * or the other CPU executes the callback, which again > + * would be blocked on vector lock. Rearming it in the > + * latter case makes it fire for nothing. > + * > + * This is also safe against the callback rearming the timer > + * because that's serialized via vector lock too. > + */ > + if (!timer_pending(&cl->timer)) { > + cl->timer.expires = jiffies + 1; > + add_timer_on(&cl->timer, cpu); > + } > } else { > apicd->prev_vector = 0; > } > raw_spin_unlock(&vector_lock); > } > > -void send_cleanup_vector(struct irq_cfg *cfg) > +void vector_schedule_cleanup(struct irq_cfg *cfg) > { > struct apic_chip_data *apicd; > > apicd = container_of(cfg, struct apic_chip_data, hw_irq_cfg); > if (apicd->move_in_progress) > - __send_cleanup_vector(apicd); > + __vector_schedule_cleanup(apicd); > } > > void irq_complete_move(struct irq_cfg *cfg) > @@ -1007,7 +1046,7 @@ void irq_complete_move(struct irq_cfg *c > * on the same CPU. > */ > if (apicd->cpu == smp_processor_id()) > - __send_cleanup_vector(apicd); > + __vector_schedule_cleanup(apicd); > } > > /* > --- a/arch/x86/kernel/idt.c > +++ b/arch/x86/kernel/idt.c > @@ -131,7 +131,6 @@ static const __initconst struct idt_data > INTG(RESCHEDULE_VECTOR, > asm_sysvec_reschedule_ipi), > INTG(CALL_FUNCTION_VECTOR, asm_sysvec_call_function), > INTG(CALL_FUNCTION_SINGLE_VECTOR, asm_sysvec_call_function_single), > - INTG(IRQ_MOVE_CLEANUP_VECTOR, > asm_sysvec_irq_move_cleanup), > INTG(REBOOT_VECTOR, asm_sysvec_reboot), > #endif > > --- a/arch/x86/platform/uv/uv_irq.c > +++ b/arch/x86/platform/uv/uv_irq.c > @@ -58,7 +58,7 @@ uv_set_irq_affinity(struct irq_data *dat > ret = parent->chip->irq_set_affinity(parent, mask, force); > if (ret >= 0) { > uv_program_mmr(cfg, data->chip_data); > - send_cleanup_vector(cfg); > + vector_schedule_cleanup(cfg); > } > > return ret; > --- a/drivers/iommu/amd/iommu.c > +++ b/drivers/iommu/amd/iommu.c > @@ -3639,7 +3639,7 @@ static int amd_ir_set_affinity(struct ir > * at the new destination. So, time to cleanup the previous > * vector allocation. > */ > - send_cleanup_vector(cfg); > + vector_schedule_cleanup(cfg); > > return IRQ_SET_MASK_OK_DONE; > } > --- a/drivers/iommu/hyperv-iommu.c > +++ b/drivers/iommu/hyperv-iommu.c > @@ -51,7 +51,7 @@ static int hyperv_ir_set_affinity(struct > if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE) > return ret; > > - send_cleanup_vector(cfg); > + vector_schedule_cleanup(cfg); > > return 0; > } > @@ -257,7 +257,7 @@ static int hyperv_root_ir_set_affinity(s > if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE) > return ret; > > - send_cleanup_vector(cfg); > + vector_schedule_cleanup(cfg); > > return 0; > } > --- a/drivers/iommu/intel/irq_remapping.c > +++ b/drivers/iommu/intel/irq_remapping.c > @@ -1180,7 +1180,7 @@ intel_ir_set_affinity(struct irq_data *d > * at the new destination. So, time to cleanup the previous > * vector allocation. > */ > - send_cleanup_vector(cfg); > + vector_schedule_cleanup(cfg); > > return IRQ_SET_MASK_OK_DONE; > }
On Mon, Jun 19 2023 at 08:00, Li, Xin3 wrote: > To me it's better to keep the changes in one patch, thus the differences > are more obvious. The rename to vector_schedule_cleanup() can be obviously done first. > We need a second patch to do vector cleanup in lapic_offline() in case the > vector cleanup timer has not expired. Right. I was lazy and just put a WARN_ON() there under the assumption that you will figure it out. But a second patch? We don't switch things over into a broken state first and then fix it up afterwards. Thanks, tglx
> > To me it's better to keep the changes in one patch, thus the > > differences are more obvious. > > The rename to vector_schedule_cleanup() can be obviously done first. Okay, it's a bit wired to me to rename before any actual code logic change. > > > We need a second patch to do vector cleanup in lapic_offline() in case > > the vector cleanup timer has not expired. > > Right. I was lazy and just put a WARN_ON() there under the assumption that you > will figure it out. I see that, as your changes to lapic_offline() are completely new. > But a second patch? > > We don't switch things over into a broken state first and then fix it up afterwards. Make sense!
On June 19, 2023 11:47:08 AM PDT, "Li, Xin3" <xin3.li@intel.com> wrote: >> > To me it's better to keep the changes in one patch, thus the >> > differences are more obvious. >> >> The rename to vector_schedule_cleanup() can be obviously done first. > >Okay, it's a bit wired to me to rename before any actual code logic change. > Weird or not, that's the established practice. However, if you think about it, it makes sense: that way your code logic patch doesn't contain a bunch of names which will almost immediately be outdated. That is *really* confusing when you are going back through the git history, for example. >> >> > We need a second patch to do vector cleanup in lapic_offline() in case >> > the vector cleanup timer has not expired. >> >> Right. I was lazy and just put a WARN_ON() there under the assumption that you >> will figure it out. > >I see that, as your changes to lapic_offline() are completely new. > >> But a second patch? >> >> We don't switch things over into a broken state first and then fix it up afterwards. > >Make sense! >
> >Okay, it's a bit wired to me to rename before any actual code logic change. > > > > Weird or not, that's the established practice. > > However, if you think about it, it makes sense: that way your code logic patch > doesn't contain a bunch of names which will almost immediately be outdated. That > is *really* confusing when you are going back through the git history, for example. Thanks for the clarification! Xin
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 766ffe3ba313..7e125fff45ab 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -248,6 +248,10 @@ DEFINE_IDTENTRY_IRQ(common_interrupt) desc = __this_cpu_read(vector_irq[vector]); if (likely(!IS_ERR_OR_NULL(desc))) { handle_irq(desc, regs); +#ifdef CONFIG_SMP + } else if (vector == IRQ_MOVE_CLEANUP_VECTOR) { + sysvec_irq_move_cleanup(regs); +#endif } else { ack_APIC_irq();