Message ID | 20221110055347.7463-6-xin3.li@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/traps,VMX: implement software based NMI/IRQ dispatch for VMX NMI/IRQ reinjection | expand |
On Wed, Nov 09, 2022, Xin Li wrote: > +#if IS_ENABLED(CONFIG_KVM_INTEL) > +/* > + * KVM VMX reinjects NMI/IRQ on its current stack, it's a sync Please use a verb other than "reinject". There is no event injection of any kind, KVM is simply making a function call. KVM already uses "inject" and "reinject" for KVM where KVM is is literally injecting events into the guest. The "kvm_vmx" part is also weird IMO. The function is in x86's traps/exceptions namespace, not the KVM VMX namespace. Maybe exc_raise_nmi_or_irq()? > + * call thus the values in the pt_regs structure are not used in > + * executing NMI/IRQ handlers, Won't this break stack traces to some extent? > +static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu, u32 vector) > { > - bool is_nmi = entry == (unsigned long)asm_exc_nmi_noist; > - > - kvm_before_interrupt(vcpu, is_nmi ? KVM_HANDLING_NMI : KVM_HANDLING_IRQ); > - vmx_do_interrupt_nmi_irqoff(entry); > + kvm_before_interrupt(vcpu, vector == NMI_VECTOR ? > + KVM_HANDLING_NMI : KVM_HANDLING_IRQ); > + kvm_vmx_reinject_nmi_irq(vector); This is where I strongly object to kvm_vmx_reinject_nmi_irq(). This looks like KVM is reinjecting the event into the guest, which is all kinds of confusing. > kvm_after_interrupt(vcpu); > }
> > +#if IS_ENABLED(CONFIG_KVM_INTEL) > > +/* > > + * KVM VMX reinjects NMI/IRQ on its current stack, it's a sync > > Please use a verb other than "reinject". There is no event injection of any kind, > KVM is simply making a function call. KVM already uses "inject" and "reinject" > for KVM where KVM is is literally injecting events into the guest. > > The "kvm_vmx" part is also weird IMO. The function is in x86's > traps/exceptions namespace, not the KVM VMX namespace. right, "kvm_vmx" doesn't look good per your explanation. > > Maybe exc_raise_nmi_or_irq()? It's good for me. > > > + * call thus the values in the pt_regs structure are not used in > > + * executing NMI/IRQ handlers, > > Won't this break stack traces to some extent? > The pt_regs structure, and its IP/CS, is NOT part of the call stack, thus I don't see a problem. No? > > +static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu, u32 > > +vector) > > { > > - bool is_nmi = entry == (unsigned long)asm_exc_nmi_noist; > > - > > - kvm_before_interrupt(vcpu, is_nmi ? KVM_HANDLING_NMI : > KVM_HANDLING_IRQ); > > - vmx_do_interrupt_nmi_irqoff(entry); > > + kvm_before_interrupt(vcpu, vector == NMI_VECTOR ? > > + KVM_HANDLING_NMI : > KVM_HANDLING_IRQ); > > + kvm_vmx_reinject_nmi_irq(vector); > > This is where I strongly object to kvm_vmx_reinject_nmi_irq(). This looks like > KVM is reinjecting the event into the guest, which is all kinds of confusing. > > > kvm_after_interrupt(vcpu); > > }
On Thu, Nov 10, 2022, Li, Xin3 wrote: > > > +#if IS_ENABLED(CONFIG_KVM_INTEL) > > > +/* > > > + * KVM VMX reinjects NMI/IRQ on its current stack, it's a sync > > > > Please use a verb other than "reinject". There is no event injection of any kind, > > KVM is simply making a function call. KVM already uses "inject" and "reinject" > > for KVM where KVM is is literally injecting events into the guest. > > > > The "kvm_vmx" part is also weird IMO. The function is in x86's > > traps/exceptions namespace, not the KVM VMX namespace. > > right, "kvm_vmx" doesn't look good per your explanation. > > > > > Maybe exc_raise_nmi_or_irq()? > > It's good for me. > > > > > > + * call thus the values in the pt_regs structure are not used in > > > + * executing NMI/IRQ handlers, > > > > Won't this break stack traces to some extent? > > > > The pt_regs structure, and its IP/CS, is NOT part of the call stack, thus > I don't see a problem. No? bool nmi_cpu_backtrace(struct pt_regs *regs) { int cpu = smp_processor_id(); unsigned long flags; if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) { /* * Allow nested NMI backtraces while serializing * against other CPUs. */ printk_cpu_sync_get_irqsave(flags); if (!READ_ONCE(backtrace_idle) && regs && cpu_in_idle(instruction_pointer(regs))) { pr_warn("NMI backtrace for cpu %d skipped: idling at %pS\n", cpu, (void *)instruction_pointer(regs)); } else { pr_warn("NMI backtrace for cpu %d\n", cpu); if (regs) show_regs(regs); <============================== HERE!!! else dump_stack(); } printk_cpu_sync_put_irqrestore(flags); cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask)); return true; } return false; }
On Thu, Nov 10, 2022 at 08:53:09PM +0000, Sean Christopherson wrote: > On Thu, Nov 10, 2022, Li, Xin3 wrote: > > > > + * call thus the values in the pt_regs structure are not used in > > > > + * executing NMI/IRQ handlers, > > > > > > Won't this break stack traces to some extent? > > > > > > > The pt_regs structure, and its IP/CS, is NOT part of the call stack, thus > > I don't see a problem. No? I'm not sure what Xin3 is trying to say, but NMI/IRQ handers use pt_regs a *LOT*. pt_regs *MUST* be correct.
On November 11, 2022 1:19:23 AM PST, Peter Zijlstra <peterz@infradead.org> wrote: >On Thu, Nov 10, 2022 at 08:53:09PM +0000, Sean Christopherson wrote: >> On Thu, Nov 10, 2022, Li, Xin3 wrote: > >> > > > + * call thus the values in the pt_regs structure are not used in >> > > > + * executing NMI/IRQ handlers, >> > > >> > > Won't this break stack traces to some extent? >> > > >> > >> > The pt_regs structure, and its IP/CS, is NOT part of the call stack, thus >> > I don't see a problem. No? > >I'm not sure what Xin3 is trying to say, but NMI/IRQ handers use pt_regs >a *LOT*. pt_regs *MUST* be correct. What is "correct" in this context? Could you describe what aspects of the register image you rely on, and what you expect them to be? Currently KVM basically stuff random data into pt_regs; this at least makes it explicitly zero.
On Fri, Nov 11, 2022 at 01:29:35AM -0800, H. Peter Anvin wrote: > On November 11, 2022 1:19:23 AM PST, Peter Zijlstra <peterz@infradead.org> wrote: > >On Thu, Nov 10, 2022 at 08:53:09PM +0000, Sean Christopherson wrote: > >> On Thu, Nov 10, 2022, Li, Xin3 wrote: > > > >> > > > + * call thus the values in the pt_regs structure are not used in > >> > > > + * executing NMI/IRQ handlers, > >> > > > >> > > Won't this break stack traces to some extent? > >> > > > >> > > >> > The pt_regs structure, and its IP/CS, is NOT part of the call stack, thus > >> > I don't see a problem. No? > > > >I'm not sure what Xin3 is trying to say, but NMI/IRQ handers use pt_regs > >a *LOT*. pt_regs *MUST* be correct. > > What is "correct" in this context? I don't know since I don't really speak virt, but I could image the regset that would match the vmrun (or whatever intel decided to call that again) instruction. > Could you describe what aspects of > the register image you rely on, and what you expect them to be? We rely on CS,IP,FLAGS,SS,SP to be coherent and usable at the very least (must be able to start an unwind from it). But things like perf (NMI) access *all* of them and possibly copy them out to userspace. Perf can also try and use the segment registers in order to try and establish a linear address. Some exceptions (#GP) access whatever is needed to fully decode and emulate the instruction (IOPL,UMIP,etc..) including the segment registers. > Currently KVM basically stuff random data into pt_regs; this at least > makes it explicitly zero. :-( Both is broken. Once again proving to me that virt is a bunch of duck-tape at best.
On 11/11/22 11:45, Peter Zijlstra wrote: >> What is "correct" in this context? > > I don't know since I don't really speak virt, but I could image the > regset that would match the vmrun (or whatever intel decided to call > that again) instruction. Right now it is not exactly that but close. The RIP is somewhere in vmx_do_interrupt_nmi_irqoff; CS/SS are correct (i.e. it's not like they point to guest values!) and other registers including RSP and RFLAGS are consistent with the RIP. >> Currently KVM basically stuff random data into pt_regs; this at least >> makes it explicitly zero. > >
On Fri, Nov 11, 2022 at 12:57:58PM +0100, Paolo Bonzini wrote: > On 11/11/22 11:45, Peter Zijlstra wrote: > > > What is "correct" in this context? > > > > I don't know since I don't really speak virt, but I could image the > > regset that would match the vmrun (or whatever intel decided to call > > that again) instruction. > > Right now it is not exactly that but close. The RIP is somewhere in > vmx_do_interrupt_nmi_irqoff; CS/SS are correct (i.e. it's not like they > point to guest values!) and other registers including RSP and RFLAGS are > consistent with the RIP. *phew*, that sounds a *lot* better than 'random'. And yes, that should do. Another thing; these patches appear to be about system vectors and everything, but what I understand from Andrew is that VMX is only screwy vs NMI, not regular interrupts/exceptions, so where does that come from? SVM specifically fixed the NMI wonkyness with their Global Interrupt flag thingy.
On 11/11/22 13:10, Peter Zijlstra wrote: > *phew*, that sounds a*lot* better than 'random'. And yes, that should > do. > > Another thing; these patches appear to be about system vectors and > everything, but what I understand from Andrew is that VMX is only screwy > vs NMI, not regular interrupts/exceptions, so where does that come from? Exceptions are fine, for interrupts it's optional in theory but in practice you have to invoke them manually just like NMIs (I had replied on this in the other thread). Paolo > SVM specifically fixed the NMI wonkyness with their Global Interrupt > flag thingy. >
> > > > + * call thus the values in the pt_regs structure are not used in > > > > + * executing NMI/IRQ handlers, > > > > > > Won't this break stack traces to some extent? > > > > > > > The pt_regs structure, and its IP/CS, is NOT part of the call stack, > > thus I don't see a problem. No? > > bool nmi_cpu_backtrace(struct pt_regs *regs) > { > int cpu = smp_processor_id(); > unsigned long flags; > > if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) { > /* > * Allow nested NMI backtraces while serializing > * against other CPUs. > */ > printk_cpu_sync_get_irqsave(flags); > if (!READ_ONCE(backtrace_idle) && regs && > cpu_in_idle(instruction_pointer(regs))) { > pr_warn("NMI backtrace for cpu %d skipped: idling at > %pS\n", > cpu, (void *)instruction_pointer(regs)); > } else { > pr_warn("NMI backtrace for cpu %d\n", cpu); > if (regs) > show_regs(regs); > <============================== HERE!!! > else > dump_stack(); > } > printk_cpu_sync_put_irqrestore(flags); > cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask)); > return true; > } > > return false; > } Right, this is an example in which pt_regs's usage gets broken with my patch. However a bigger problem emerges, how NMI handlers should get called when VMX is running. If we could address it, we will probably be okay with pt_regs's usage.
> -----Original Message----- > From: Paolo Bonzini <pbonzini@redhat.com> > Sent: Friday, November 11, 2022 3:58 AM > To: Peter Zijlstra <peterz@infradead.org>; H. Peter Anvin <hpa@zytor.com> > Cc: Christopherson,, Sean <seanjc@google.com>; Li, Xin3 <xin3.li@intel.com>; > linux-kernel@vger.kernnel.org; x86@kernel.org; kvm@vger.kernel.org; > tglx@linutronix.de; mingo@redhat.com; bp@alien8.de; > dave.hansen@linux.intel.com; Tian, Kevin <kevin.tian@intel.com> > Subject: Re: [PATCH 5/6] KVM: x86/VMX: add kvm_vmx_reinject_nmi_irq() for > NMI/IRQ reinjection > > On 11/11/22 11:45, Peter Zijlstra wrote: > >> What is "correct" in this context? > > > > I don't know since I don't really speak virt, but I could image the > > regset that would match the vmrun (or whatever intel decided to call > > that again) instruction. > > Right now it is not exactly that but close. The RIP is somewhere in > vmx_do_interrupt_nmi_irqoff; CS/SS are correct (i.e. it's not like they point to > guest values!) and other registers including RSP and RFLAGS are consistent with > the RIP. > > >> Currently KVM basically stuff random data into pt_regs; this at least > >> makes it explicitly zero. > > > >
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 89c4233e19db..4c56a8d31762 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -57,4 +57,6 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs, unsigned long vector __maybe_unused) typedef DECLARE_SYSTEM_INTERRUPT_HANDLER((*system_interrupt_handler)); +void kvm_vmx_reinject_nmi_irq(u32 vector); + #endif /* _ASM_X86_TRAPS_H */ diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index c1eb3bd335ce..9abf91534b13 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -1528,6 +1528,29 @@ __visible noinstr void external_interrupt(struct pt_regs *regs, common_interrupt(regs, vector); } +#if IS_ENABLED(CONFIG_KVM_INTEL) +/* + * KVM VMX reinjects NMI/IRQ on its current stack, it's a sync + * call thus the values in the pt_regs structure are not used in + * executing NMI/IRQ handlers, except cs.RPL and flags.IF, which + * are both always 0 in the VMX NMI/IRQ reinjection context. Thus + * we simply allocate a zeroed pt_regs structure on current stack + * to call external_interrupt(). + */ +void kvm_vmx_reinject_nmi_irq(u32 vector) +{ + struct pt_regs irq_regs; + + memset(&irq_regs, 0, sizeof(irq_regs)); + + if (vector == NMI_VECTOR) + return exc_nmi(&irq_regs); + + external_interrupt(&irq_regs, vector); +} +EXPORT_SYMBOL_GPL(kvm_vmx_reinject_nmi_irq); +#endif + void __init trap_init(void) { /* Init cpu_entry_area before IST entries are set up */ diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S index 8477d8bdd69c..0c1608b329cd 100644 --- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -317,36 +317,3 @@ SYM_FUNC_START(vmread_error_trampoline) RET SYM_FUNC_END(vmread_error_trampoline) - -SYM_FUNC_START(vmx_do_interrupt_nmi_irqoff) - /* - * Unconditionally create a stack frame, getting the correct RSP on the - * stack (for x86-64) would take two instructions anyways, and RBP can - * be used to restore RSP to make objtool happy (see below). - */ - push %_ASM_BP - mov %_ASM_SP, %_ASM_BP - -#ifdef CONFIG_X86_64 - /* - * Align RSP to a 16-byte boundary (to emulate CPU behavior) before - * creating the synthetic interrupt stack frame for the IRQ/NMI. - */ - and $-16, %rsp - push $__KERNEL_DS - push %rbp -#endif - pushf - push $__KERNEL_CS - CALL_NOSPEC _ASM_ARG1 - - /* - * "Restore" RSP from RBP, even though IRET has already unwound RSP to - * the correct value. objtool doesn't know the callee will IRET and, - * without the explicit restore, thinks the stack is getting walloped. - * Using an unwind hint is problematic due to x86-64's dynamic alignment. - */ - mov %_ASM_BP, %_ASM_SP - pop %_ASM_BP - RET -SYM_FUNC_END(vmx_do_interrupt_nmi_irqoff) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 63247c57c72c..b457e4888468 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -46,6 +46,7 @@ #include <asm/mshyperv.h> #include <asm/mwait.h> #include <asm/spec-ctrl.h> +#include <asm/traps.h> #include <asm/virtext.h> #include <asm/vmx.h> @@ -6758,15 +6759,11 @@ static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu) memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir)); } -void vmx_do_interrupt_nmi_irqoff(unsigned long entry); - -static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu, - unsigned long entry) +static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu, u32 vector) { - bool is_nmi = entry == (unsigned long)asm_exc_nmi_noist; - - kvm_before_interrupt(vcpu, is_nmi ? KVM_HANDLING_NMI : KVM_HANDLING_IRQ); - vmx_do_interrupt_nmi_irqoff(entry); + kvm_before_interrupt(vcpu, vector == NMI_VECTOR ? + KVM_HANDLING_NMI : KVM_HANDLING_IRQ); + kvm_vmx_reinject_nmi_irq(vector); kvm_after_interrupt(vcpu); } @@ -6792,7 +6789,6 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu) static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx) { - const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist; u32 intr_info = vmx_get_intr_info(&vmx->vcpu); /* if exit due to PF check for async PF */ @@ -6806,20 +6802,19 @@ static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx) kvm_machine_check(); /* We need to handle NMIs before interrupts are enabled */ else if (is_nmi(intr_info)) - handle_interrupt_nmi_irqoff(&vmx->vcpu, nmi_entry); + handle_interrupt_nmi_irqoff(&vmx->vcpu, NMI_VECTOR); } static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu) { u32 intr_info = vmx_get_intr_info(vcpu); unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK; - gate_desc *desc = (gate_desc *)host_idt_base + vector; if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm, "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info)) return; - handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc)); + handle_interrupt_nmi_irqoff(vcpu, vector); vcpu->arch.at_instruction_boundary = true; }