Message ID | 20210913200132.3396598-6-sohil.mehta@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86 User Interrupts support | expand |
On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote: > A user interrupt notification vector is used on the receiver's cpu to > identify an interrupt as a user interrupt (and not a kernel interrupt). > Hardware uses the same notification vector to generate an IPI from a > sender's cpu core when the SENDUIPI instruction is executed. > > Typically, the kernel shouldn't receive an interrupt with this vector. > However, it is possible that the kernel might receive this vector. > > Scenario that can cause the spurious interrupt: > > Step cpu 0 (receiver task) cpu 1 (sender task) > ---- --------------------- ------------------- > 1 task is running > 2 executes SENDUIPI > 3 IPI sent > 4 context switched out > 5 IPI delivered > (kernel interrupt detected) > > A kernel interrupt can be detected, if a receiver task gets scheduled > out after the SENDUIPI-based IPI was sent but before the IPI was > delivered. What happens if the SENDUIPI is issued when the target task is not on the CPU? How is that any different from the above? > The kernel doesn't need to do anything in this case other than receiving > the interrupt and clearing the local APIC. The user interrupt is always > stored in the receiver's UPID before the IPI is generated. When the > receiver gets scheduled back the interrupt would be delivered based on > its UPID. So why on earth is that vector reaching the CPU at all? > +#ifdef CONFIG_X86_USER_INTERRUPTS > + seq_printf(p, "%*s: ", prec, "UIS"); No point in printing that when user interrupts are not available/enabled on the system. > + for_each_online_cpu(j) > + seq_printf(p, "%10u ", irq_stats(j)->uintr_spurious_count); > + seq_puts(p, " User-interrupt spurious event\n"); > #endif > return 0; > } > @@ -325,6 +331,33 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi) > } > #endif > > +#ifdef CONFIG_X86_USER_INTERRUPTS > +/* > + * Handler for UINTR_NOTIFICATION_VECTOR. > + * > + * The notification vector is used by the cpu to detect a User Interrupt. In > + * the typical usage, the cpu would handle this interrupt and clear the local > + * apic. > + * > + * However, it is possible that the kernel might receive this vector. This can > + * happen if the receiver thread was running when the interrupt was sent but it > + * got scheduled out before the interrupt was delivered. The kernel doesn't > + * need to do anything other than clearing the local APIC. A pending user > + * interrupt is always saved in the receiver's UPID which can be referenced > + * when the receiver gets scheduled back. > + * > + * If the kernel receives a storm of these, it could mean an issue with the > + * kernel's saving and restoring of the User Interrupt MSR state; Specifically, > + * the notification vector bits in the IA32_UINTR_MISC_MSR. Definitely well thought out hardware that. Thanks, tglx
On Fri, Sep 24 2021 at 01:07, Thomas Gleixner wrote: > On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote: >> The kernel doesn't need to do anything in this case other than receiving >> the interrupt and clearing the local APIC. The user interrupt is always >> stored in the receiver's UPID before the IPI is generated. When the >> receiver gets scheduled back the interrupt would be delivered based on >> its UPID. > > So why on earth is that vector reaching the CPU at all? Let's see how this works: task starts using UINTR. set UINTR_NOTIFACTION_VECTOR in MSR_IA32_UINTR_MISC So from that point on the User-Interrupt Notification Identification mechanism swallows the vector. Where this stops working is not limited to context switch. The wreckage comes from XSAVES: "After saving the user-interrupt state component, XSAVES clears UINV. (UINV is IA32_UINTR_MISC[39:32]; XSAVES does not modify the remainder of that MSR.)" So the problem is _not_ context switch. The problem is XSAVES and that can be issued even without a context switch. The obvious question is: What is the value of clearing UINV? Absolutely none. That notification vector cannot be used for anything else, so why would the OS be interested to see it ever? This is about user space interupts, right? UINV should be set _ONCE_ when CR4.UINTR is enabled and not be touched by XSAVES/XRSTORS at all. Any delivery of this vector to the OS should be considered a hardware bug. Thanks, tglx
On Sat, Sep 25 2021 at 15:30, Thomas Gleixner wrote: > On Fri, Sep 24 2021 at 01:07, Thomas Gleixner wrote: > The obvious question is: What is the value of clearing UINV? > > Absolutely none. That notification vector cannot be used for anything > else, so why would the OS be interested to see it ever? This is about > user space interupts, right? > > UINV should be set _ONCE_ when CR4.UINTR is enabled and not be touched > by XSAVES/XRSTORS at all. Any delivery of this vector to the OS should > be considered a hardware bug. After decoding the documentation (sigh) and staring at the implications of keeping UINV armed, I can see the point vs. the UPID lifetime issue when a task gets scheduled out and migrated to a different CPU. Not the most pretty solution, but as there needs to be some invalidation which needs to be undone on return to user space it probably does not matter much. As the whole thing is tightly coupled to XSAVES/RSTORS we need to integrate it into that machinery and not pretend that it's something half independent. That means we have to handle the setting of the SN bit in UPID whenever XSTATE is saved either during context switch, when the kernel uses the FPU or in other places (signals, fpu_clone ...). They all end up in save_fpregs_to_fpstate() so that might be the place to look at. While talking about that: fpu_clone() has to invalidate the UINTR state in the clone's xstate after the memcpy() or xsaves() operation. Also the restore portion on the way back to user space has to be coupled more tightly: arch_exit_to_user_mode_prepare() { ... if (unlikely(ti_work & _TIF_UPID)) uintr_restore_upid(ti_work & _TIF_NEED_FPU_LOAD); if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) switch_fpu_return(); } upid_set_ndst(upid) { apicid = __this_cpu_read(x86_cpu_to_apicid); if (x2apic_enabled()) upid->ndst.x2apic = apicid; else upid->ndst.apic = apicid; } uintr_restore_upid(bool xrstors_pending) { clear_thread_flag(TIF_UPID); // Update destination upid_set_ndst(upid); // Do we need something stronger here? barrier(); clear_bit(SN, upid->status); // Any SENDUIPI after this point sends to this CPU // Any bit which was set in upid->pir after SN was set // and/or UINV was cleared by XSAVES up to the point // where SN was cleared above is not reflected in UIRR. // As this runs with interrupts disabled the current state // of upid->pir can be read and used for restore. A SENDUIPI // which sets a bit in upid->pir after that read will send // the notification vector which is going to be handled once // the task reenables interrupts on return to user space. // If the SENDUIPI set the bit before the read then the // notification vector handling will just observe the same // PIR state. // Needs to be a locked access as there might be a // concurrent SENDUIPI modiying it. pir = read_locked(upid->pir); if (xrstors_pending)) { // Update the saved xstate for xrstors current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR; current->xstate.uintr.uirr = pir; } else { // Manually restore UIRR and UINV wrmsrl(IA32_UINTR_RR, pir); misc.val64 = 0; misc.uittsz = current->uintr->uittsz; misc.uinv = UINTR_NOTIFICATION_VECTOR; wrmsrl(IA32_UINTR_MISC, misc.val64); } } That's how I deciphered the documentation and I don't think this is far from reality, but I might be wrong as usual. Hmm? Thanks, tglx
On 9/26/2021 5:39 AM, Thomas Gleixner wrote: > On Sat, Sep 25 2021 at 15:30, Thomas Gleixner wrote: >> On Fri, Sep 24 2021 at 01:07, Thomas Gleixner wrote: >> The obvious question is: What is the value of clearing UINV? >> >> Absolutely none. That notification vector cannot be used for anything >> else, so why would the OS be interested to see it ever? This is about >> user space interupts, right? >> >> UINV should be set _ONCE_ when CR4.UINTR is enabled and not be touched >> by XSAVES/XRSTORS at all. Any delivery of this vector to the OS should >> be considered a hardware bug. > After decoding the documentation (sigh) and staring at the implications of > keeping UINV armed, I can see the point vs. the UPID lifetime issue when > a task gets scheduled out and migrated to a different CPU. I think you got it right. Here is my understanding of this. The User-interrupt notification processing moves all the pending interrupts from UPID.PIR to the UIRR. As you mentioned below, XSTATE is saved due to several reasons which saves the UIRR into memory. UIRR should no longer be updated after it has been saved. XSAVES clears UINV is to stop detecting additional interrupts in the UIRR after it has been saved. > Not the most pretty solution, but as there needs to be some invalidation > which needs to be undone on return to user space it probably does not > matter much. > > As the whole thing is tightly coupled to XSAVES/RSTORS we need to > integrate it into that machinery and not pretend that it's something > half independent. I agree. Thank you for pointing this out. > That means we have to handle the setting of the SN bit in UPID whenever > XSTATE is saved either during context switch, when the kernel uses the > FPU or in other places (signals, fpu_clone ...). They all end up in > save_fpregs_to_fpstate() so that might be the place to look at. Yes. The current code doesn't do this. SN bit should be set whenever UINTR XSTATE is saved. > While talking about that: fpu_clone() has to invalidate the UINTR state > in the clone's xstate after the memcpy() or xsaves() operation. > > Also the restore portion on the way back to user space has to be coupled > more tightly: > > arch_exit_to_user_mode_prepare() > { > ... > if (unlikely(ti_work & _TIF_UPID)) > uintr_restore_upid(ti_work & _TIF_NEED_FPU_LOAD); > if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) > switch_fpu_return(); > } I am assuming _TIF_UPID would be set everytime SN is set and XSTATE is saved. > upid_set_ndst(upid) > { > apicid = __this_cpu_read(x86_cpu_to_apicid); > > if (x2apic_enabled()) > upid->ndst.x2apic = apicid; > else > upid->ndst.apic = apicid; > } > > uintr_restore_upid(bool xrstors_pending) > { > clear_thread_flag(TIF_UPID); > > // Update destination > upid_set_ndst(upid); > > // Do we need something stronger here? > barrier(); > > clear_bit(SN, upid->status); > > // Any SENDUIPI after this point sends to this CPU > > // Any bit which was set in upid->pir after SN was set > // and/or UINV was cleared by XSAVES up to the point > // where SN was cleared above is not reflected in UIRR. > > // As this runs with interrupts disabled the current state > // of upid->pir can be read and used for restore. A SENDUIPI > // which sets a bit in upid->pir after that read will send > // the notification vector which is going to be handled once > // the task reenables interrupts on return to user space. > // If the SENDUIPI set the bit before the read then the > // notification vector handling will just observe the same > // PIR state. > > // Needs to be a locked access as there might be a > // concurrent SENDUIPI modiying it. > pir = read_locked(upid->pir); > > if (xrstors_pending)) { > // Update the saved xstate for xrstors > current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR; XSAVES saves the UINV value into the XSTATE buffer. I am not sure if we need this again. Is it because it could have been overwritten by calling XSAVES twice? > current->xstate.uintr.uirr = pir; I believe PIR should be ORed. There could be some bits already set in the UIRR. Also, shouldn't UPID->PIR be cleared? If not, we would detect these interrupts all over again during the next ring transition. > } else { > // Manually restore UIRR and UINV > wrmsrl(IA32_UINTR_RR, pir); I believe read-modify-write here as well. > misc.val64 = 0; > misc.uittsz = current->uintr->uittsz; > misc.uinv = UINTR_NOTIFICATION_VECTOR; > wrmsrl(IA32_UINTR_MISC, misc.val64); Thanks! This helps reduce the additional MSR read. > } > } > > That's how I deciphered the documentation and I don't think this is far > from reality, but I might be wrong as usual. > > Hmm? Thank you for the simplification. This is very helpful. Sohil
On 9/23/2021 4:07 PM, Thomas Gleixner wrote: > On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote: >> A user interrupt notification vector is used on the receiver's cpu to >> identify an interrupt as a user interrupt (and not a kernel interrupt). >> Hardware uses the same notification vector to generate an IPI from a >> sender's cpu core when the SENDUIPI instruction is executed. >> >> Typically, the kernel shouldn't receive an interrupt with this vector. >> However, it is possible that the kernel might receive this vector. >> >> Scenario that can cause the spurious interrupt: >> >> Step cpu 0 (receiver task) cpu 1 (sender task) >> ---- --------------------- ------------------- >> 1 task is running >> 2 executes SENDUIPI >> 3 IPI sent >> 4 context switched out >> 5 IPI delivered >> (kernel interrupt detected) >> >> A kernel interrupt can be detected, if a receiver task gets scheduled >> out after the SENDUIPI-based IPI was sent but before the IPI was >> delivered. > What happens if the SENDUIPI is issued when the target task is not on > the CPU? How is that any different from the above? This didn't get covered in the other thread. Thought, I would clarify this a bit more. A notification IPI is sent from the CPU that executes SENDUIPI if the target task is running (SN is 0). If the target task is not running SN bit in the UPID is set, which prevents any notification interrupts from being generated. However, it is possible that SN is 0 when SENDUIPI was executed which generates the notification IPI. But when the IPI arrives on receiver CPU, SN has been set, the task state has been saved and UINV has been cleared. A kernel interrupt is detected in this case. I have a sample that demos this. I'll fix the current code and then send out the results. >> The kernel doesn't need to do anything in this case other than receiving >> the interrupt and clearing the local APIC. The user interrupt is always >> stored in the receiver's UPID before the IPI is generated. When the >> receiver gets scheduled back the interrupt would be delivered based on >> its UPID. > So why on earth is that vector reaching the CPU at all? You covered this in the other thread. >> +#ifdef CONFIG_X86_USER_INTERRUPTS >> + seq_printf(p, "%*s: ", prec, "UIS"); > No point in printing that when user interrupts are not available/enabled > on the system. > Will fix this. Thanks, Sohil
Sohil, On Mon, Sep 27 2021 at 12:07, Sohil Mehta wrote: > On 9/26/2021 5:39 AM, Thomas Gleixner wrote: > > The User-interrupt notification processing moves all the pending > interrupts from UPID.PIR to the UIRR. Indeed that makes sense. Should have thought about that myself. >> Also the restore portion on the way back to user space has to be coupled >> more tightly: >> >> arch_exit_to_user_mode_prepare() >> { >> ... >> if (unlikely(ti_work & _TIF_UPID)) >> uintr_restore_upid(ti_work & _TIF_NEED_FPU_LOAD); >> if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) >> switch_fpu_return(); >> } > > I am assuming _TIF_UPID would be set everytime SN is set and XSTATE is > saved. Yes. >> upid_set_ndst(upid) >> { >> apicid = __this_cpu_read(x86_cpu_to_apicid); >> >> if (x2apic_enabled()) >> upid->ndst.x2apic = apicid; >> else >> upid->ndst.apic = apicid; >> } >> >> uintr_restore_upid(bool xrstors_pending) >> { >> clear_thread_flag(TIF_UPID); >> >> // Update destination >> upid_set_ndst(upid); >> >> // Do we need something stronger here? >> barrier(); >> >> clear_bit(SN, upid->status); >> >> // Any SENDUIPI after this point sends to this CPU >> >> // Any bit which was set in upid->pir after SN was set >> // and/or UINV was cleared by XSAVES up to the point >> // where SN was cleared above is not reflected in UIRR. >> >> // As this runs with interrupts disabled the current state >> // of upid->pir can be read and used for restore. A SENDUIPI >> // which sets a bit in upid->pir after that read will send >> // the notification vector which is going to be handled once >> // the task reenables interrupts on return to user space. >> // If the SENDUIPI set the bit before the read then the >> // notification vector handling will just observe the same >> // PIR state. >> >> // Needs to be a locked access as there might be a >> // concurrent SENDUIPI modiying it. >> pir = read_locked(upid->pir); >> >> if (xrstors_pending)) { >> // Update the saved xstate for xrstors >> current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR; > > XSAVES saves the UINV value into the XSTATE buffer. I am not sure if we > need this again. Is it because it could have been overwritten by calling > XSAVES twice? Yes that can happen AFAICT. I haven't done a deep analysis, but this needs to looked at. >> current->xstate.uintr.uirr = pir; > > I believe PIR should be ORed. There could be some bits already set in > the UIRR. > > Also, shouldn't UPID->PIR be cleared? If not, we would detect these > interrupts all over again during the next ring transition. Right. So that PIR read above needs to be a locked cmpxchg(). >> } else { >> // Manually restore UIRR and UINV >> wrmsrl(IA32_UINTR_RR, pir); > I believe read-modify-write here as well. Sigh, yes. Thanks, tglx
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index 275e7fd20310..279afc01f1ac 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -19,6 +19,9 @@ typedef struct { unsigned int kvm_posted_intr_ipis; unsigned int kvm_posted_intr_wakeup_ipis; unsigned int kvm_posted_intr_nested_ipis; +#endif +#ifdef CONFIG_X86_USER_INTERRUPTS + unsigned int uintr_spurious_count; #endif unsigned int x86_platform_ipis; /* arch dependent */ unsigned int apic_perf_irqs; diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 1345088e9902..5929a6f9eeee 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -671,6 +671,10 @@ DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR, sysvec_kvm_posted_intr_wakeup DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR, sysvec_kvm_posted_intr_nested_ipi); #endif +#ifdef CONFIG_X86_USER_INTERRUPTS +DECLARE_IDTENTRY_SYSVEC(UINTR_NOTIFICATION_VECTOR, sysvec_uintr_spurious_interrupt); +#endif + #if IS_ENABLED(CONFIG_HYPERV) DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR, sysvec_hyperv_callback); DECLARE_IDTENTRY_SYSVEC(HYPERV_REENLIGHTENMENT_VECTOR, sysvec_hyperv_reenlightenment); diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index 43dcb9284208..d26faa504931 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -104,7 +104,10 @@ #define HYPERV_STIMER0_VECTOR 0xed #endif -#define LOCAL_TIMER_VECTOR 0xec +/* Vector for User interrupt notifications */ +#define UINTR_NOTIFICATION_VECTOR 0xec + +#define LOCAL_TIMER_VECTOR 0xeb #define NR_VECTORS 256 diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c index df0fa695bb09..d8c45e0728f0 100644 --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -147,6 +147,9 @@ static const __initconst struct idt_data apic_idts[] = { INTG(POSTED_INTR_WAKEUP_VECTOR, asm_sysvec_kvm_posted_intr_wakeup_ipi), INTG(POSTED_INTR_NESTED_VECTOR, asm_sysvec_kvm_posted_intr_nested_ipi), # endif +#ifdef CONFIG_X86_USER_INTERRUPTS + INTG(UINTR_NOTIFICATION_VECTOR, asm_sysvec_uintr_spurious_interrupt), +#endif # ifdef CONFIG_IRQ_WORK INTG(IRQ_WORK_VECTOR, asm_sysvec_irq_work), # endif diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index e28f6a5d14f1..e3c35668c7c5 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -181,6 +181,12 @@ int arch_show_interrupts(struct seq_file *p, int prec) seq_printf(p, "%10u ", irq_stats(j)->kvm_posted_intr_wakeup_ipis); seq_puts(p, " Posted-interrupt wakeup event\n"); +#endif +#ifdef CONFIG_X86_USER_INTERRUPTS + seq_printf(p, "%*s: ", prec, "UIS"); + for_each_online_cpu(j) + seq_printf(p, "%10u ", irq_stats(j)->uintr_spurious_count); + seq_puts(p, " User-interrupt spurious event\n"); #endif return 0; } @@ -325,6 +331,33 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi) } #endif +#ifdef CONFIG_X86_USER_INTERRUPTS +/* + * Handler for UINTR_NOTIFICATION_VECTOR. + * + * The notification vector is used by the cpu to detect a User Interrupt. In + * the typical usage, the cpu would handle this interrupt and clear the local + * apic. + * + * However, it is possible that the kernel might receive this vector. This can + * happen if the receiver thread was running when the interrupt was sent but it + * got scheduled out before the interrupt was delivered. The kernel doesn't + * need to do anything other than clearing the local APIC. A pending user + * interrupt is always saved in the receiver's UPID which can be referenced + * when the receiver gets scheduled back. + * + * If the kernel receives a storm of these, it could mean an issue with the + * kernel's saving and restoring of the User Interrupt MSR state; Specifically, + * the notification vector bits in the IA32_UINTR_MISC_MSR. + */ +DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_uintr_spurious_interrupt) +{ + /* TODO: Add entry-exit tracepoints */ + ack_APIC_irq(); + inc_irq_stat(uintr_spurious_count); +} +#endif + #ifdef CONFIG_HOTPLUG_CPU /* A cpu has been removed from cpu_online_mask. Reset irq affinities. */