Message ID | 20200915191505.10355-3-sean.j.christopherson@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: VMX: Clean up IRQ/NMI handling | expand |
Add CC: Andy Lutomirski Add CC: Steven Rostedt I think this patch made it wrong for NMI. On Wed, Sep 16, 2020 at 3:27 AM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > Rework NMI VM-Exit handling to invoke the kernel handler by function > call instead of INTn. INTn microcode is relatively expensive, and > aligning the IRQ and NMI handling will make it easier to update KVM > should some newfangled method for invoking the handlers come along. > > Suggested-by: Andi Kleen <ak@linux.intel.com> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > --- > arch/x86/kvm/vmx/vmx.c | 30 +++++++++++++++--------------- > 1 file changed, 15 insertions(+), 15 deletions(-) > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 391f079d9136..b0eca151931d 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -6411,40 +6411,40 @@ static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu) > > void vmx_do_interrupt_nmi_irqoff(unsigned long entry); > > +static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info) > +{ > + unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK; > + gate_desc *desc = (gate_desc *)host_idt_base + vector; > + > + kvm_before_interrupt(vcpu); > + vmx_do_interrupt_nmi_irqoff(gate_offset(desc)); > + kvm_after_interrupt(vcpu); > +} > + > static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx) > { > u32 intr_info = vmx_get_intr_info(&vmx->vcpu); > > /* if exit due to PF check for async PF */ > - if (is_page_fault(intr_info)) { > + if (is_page_fault(intr_info)) > vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags(); > /* Handle machine checks before interrupts are enabled */ > - } else if (is_machine_check(intr_info)) { > + else if (is_machine_check(intr_info)) > kvm_machine_check(); > /* We need to handle NMIs before interrupts are enabled */ > - } else if (is_nmi(intr_info)) { > - kvm_before_interrupt(&vmx->vcpu); > - asm("int $2"); > - kvm_after_interrupt(&vmx->vcpu); > - } > + else if (is_nmi(intr_info)) > + handle_interrupt_nmi_irqoff(&vmx->vcpu, intr_info); > } When handle_interrupt_nmi_irqoff() is called, we may lose the CPU-hidden-NMI-masked state due to IRET of #DB, #BP or other traps between VMEXIT and handle_interrupt_nmi_irqoff(). But the NMI handler in the Linux kernel *expects* the CPU-hidden-NMI-masked state is still set in the CPU for no nested NMI intruding into the beginning of the handler. The original code "int $2" can provide the needed CPU-hidden-NMI-masked when entering #NMI, but I doubt it about this change. I maybe missed something, especially I haven't read all of the earlier discussions about the change. More importantly, I haven't found the original suggestion from Andi Kleen: (Quote from the cover letter): The NMI consolidation was loosely suggested by Andi Kleen. Andi's actual suggestion was to export and directly call the NMI handler, but that's a more involved change (unless I'm misunderstanding the wants of the NMI handler), whereas piggybacking the IRQ code is simple and seems like a worthwhile intermediate step. (End of quote) I think we need to change it back or change it to call the NMI handler immediately after VMEXIT before leaving "nostr" section if needed. Thanks, Lai
On 26/04/21 11:33, Lai Jiangshan wrote: > When handle_interrupt_nmi_irqoff() is called, we may lose the > CPU-hidden-NMI-masked state due to IRET of #DB, #BP or other traps > between VMEXIT and handle_interrupt_nmi_irqoff(). > > But the NMI handler in the Linux kernel*expects* the CPU-hidden-NMI-masked > state is still set in the CPU for no nested NMI intruding into the beginning > of the handler. > > The original code "int $2" can provide the needed CPU-hidden-NMI-masked > when entering #NMI, but I doubt it about this change. How would "int $2" block NMIs? The hidden effect of this change (and I should have reviewed better the effect on the NMI entry code) is that the call will not use the IST anymore. However, I'm not sure which of the two situations is better: entering the NMI handler on the IST without setting the hidden NMI-blocked flag could be a recipe for bad things as well. Paolo
On Mon, 2021-04-26 at 12:40 +0200, Paolo Bonzini wrote: > On 26/04/21 11:33, Lai Jiangshan wrote: > > When handle_interrupt_nmi_irqoff() is called, we may lose the > > CPU-hidden-NMI-masked state due to IRET of #DB, #BP or other traps > > between VMEXIT and handle_interrupt_nmi_irqoff(). > > > > But the NMI handler in the Linux kernel*expects* the CPU-hidden-NMI-masked > > state is still set in the CPU for no nested NMI intruding into the beginning > > of the handler. > > > > The original code "int $2" can provide the needed CPU-hidden-NMI-masked > > when entering #NMI, but I doubt it about this change. > > How would "int $2" block NMIs? The hidden effect of this change (and I > should have reviewed better the effect on the NMI entry code) is that > the call will not use the IST anymore. > > However, I'm not sure which of the two situations is better: entering > the NMI handler on the IST without setting the hidden NMI-blocked flag > could be a recipe for bad things as well. If I understand this correctly, we can't really set the NMI blocked flag on Intel, but only keep it from beeing cleared by an iret after it was set by the intercepted NMI. Thus the goal of this patchset was to make sure that we don't call any interrupt handlers that can do iret before we call the NMI handler Indeed I don't think that doing int $2 helps, unless I miss something. We just need to make sure that we call the NMI handler as soon as possible. If only Intel had the GI flag.... My 0.2 cents. Best regards, Maxim Levitsky > > Paolo >
On Mon, 26 Apr 2021 14:44:49 +0300 Maxim Levitsky <mlevitsk@redhat.com> wrote: > On Mon, 2021-04-26 at 12:40 +0200, Paolo Bonzini wrote: > > On 26/04/21 11:33, Lai Jiangshan wrote: > > > When handle_interrupt_nmi_irqoff() is called, we may lose the > > > CPU-hidden-NMI-masked state due to IRET of #DB, #BP or other traps > > > between VMEXIT and handle_interrupt_nmi_irqoff(). > > > > > > But the NMI handler in the Linux kernel*expects* the CPU-hidden-NMI-masked > > > state is still set in the CPU for no nested NMI intruding into the beginning > > > of the handler. This is incorrect. The Linux kernel has for some time handled the case of nested NMIs. It had to, to implement the ftrace break point updates, as it would trigger an int3 in an NMI which would "unmask" the NMIs. It has also been a long time bug where a page fault could do the same (the reason you could never do a dump all tasks from NMI without triple faulting!). But that's been fixed a long time ago, and I even wrote an LWN article about it ;-) https://lwn.net/Articles/484932/ The NMI handler can handle the case of nested NMIs, and implements a software "latch" to remember that another NMI is to be executed, if there is a nested one. And it does so after the first one has finished. -- Steve
> > The original code "int $2" can provide the needed CPU-hidden-NMI-masked > > when entering #NMI, but I doubt it about this change. > > How would "int $2" block NMIs? The hidden effect of this change (and I > should have reviewed better the effect on the NMI entry code) is that the > call will not use the IST anymore. My understanding is that int $2 does not block NMIs. So reentries might have been possible. -Andi
> On Apr 26, 2021, at 7:51 AM, Andi Kleen <ak@linux.intel.com> wrote: > > >> >>> The original code "int $2" can provide the needed CPU-hidden-NMI-masked >>> when entering #NMI, but I doubt it about this change. >> >> How would "int $2" block NMIs? The hidden effect of this change (and I >> should have reviewed better the effect on the NMI entry code) is that the >> call will not use the IST anymore. > > My understanding is that int $2 does not block NMIs. > > So reentries might have been possible. > The C NMI code has its own reentrancy protection and has for years. It should work fine for this use case. > -Andi
(Correct Sean Christopherson's email address) On Mon, Apr 26, 2021 at 6:40 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 26/04/21 11:33, Lai Jiangshan wrote: > > When handle_interrupt_nmi_irqoff() is called, we may lose the > > CPU-hidden-NMI-masked state due to IRET of #DB, #BP or other traps > > between VMEXIT and handle_interrupt_nmi_irqoff(). > > > > But the NMI handler in the Linux kernel*expects* the CPU-hidden-NMI-masked > > state is still set in the CPU for no nested NMI intruding into the beginning > > of the handler. > > > > The original code "int $2" can provide the needed CPU-hidden-NMI-masked > > when entering #NMI, but I doubt it about this change. > > How would "int $2" block NMIs? Sorry, I haven't checked it. > The hidden effect of this change (and I > should have reviewed better the effect on the NMI entry code) is that > the call will not use the IST anymore. > > However, I'm not sure which of the two situations is better: entering > the NMI handler on the IST without setting the hidden NMI-blocked flag > could be a recipe for bad things as well. The change makes the ASM NMI entry called on the kernel stack. But the ASM NMI entry expects it on the IST stack and it plays with "NMI executing" variable on the IST stack. In this change, the stranded ASM NMI entry will use the wrong/garbage "NMI executing" variable on the kernel stack and may do some very wrong thing. On Mon, Apr 26, 2021 at 9:59 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > But the NMI handler in the Linux kernel*expects* the CPU-hidden-NMI-masked > > > > state is still set in the CPU for no nested NMI intruding into the beginning > > > > of the handler. > > > This is incorrect. The Linux kernel has for some time handled the case of > nested NMIs. It had to, to implement the ftrace break point updates, as it > would trigger an int3 in an NMI which would "unmask" the NMIs. It has also > been a long time bug where a page fault could do the same (the reason you > could never do a dump all tasks from NMI without triple faulting!). > > But that's been fixed a long time ago, and I even wrote an LWN article > about it ;-) > > https://lwn.net/Articles/484932/ > > The NMI handler can handle the case of nested NMIs, and implements a > software "latch" to remember that another NMI is to be executed, if there > is a nested one. And it does so after the first one has finished. Sorry, in my reply, "the NMI handler" meant to be the ASM entry installed on the IDT table which really expects to be NMI-masked at the beginning. The C NMI handler can handle the case of nested NMIs, which is useful here. I think we should change it to call the C NMI handler directly here as Andy Lutomirski suggested: On Mon, Apr 26, 2021 at 11:09 PM Andy Lutomirski <luto@amacapital.net> wrote: > The C NMI code has its own reentrancy protection and has for years. > It should work fine for this use case. I think this is the right way.
On Tue, 27 Apr 2021 08:54:37 +0800 Lai Jiangshan <jiangshanlai+lkml@gmail.com> wrote: > > However, I'm not sure which of the two situations is better: entering > > the NMI handler on the IST without setting the hidden NMI-blocked flag > > could be a recipe for bad things as well. > > The change makes the ASM NMI entry called on the kernel stack. But the > ASM NMI entry expects it on the IST stack and it plays with "NMI executing" > variable on the IST stack. In this change, the stranded ASM NMI entry > will use the wrong/garbage "NMI executing" variable on the kernel stack > and may do some very wrong thing. I missed this detail. > > Sorry, in my reply, "the NMI handler" meant to be the ASM entry installed > on the IDT table which really expects to be NMI-masked at the beginning. > > The C NMI handler can handle the case of nested NMIs, which is useful > here. I think we should change it to call the C NMI handler directly > here as Andy Lutomirski suggested: Yes, because that's the way x86_32 works. > > On Mon, Apr 26, 2021 at 11:09 PM Andy Lutomirski <luto@amacapital.net> wrote: > > The C NMI code has its own reentrancy protection and has for years. > > It should work fine for this use case. > > I think this is the right way. Agreed. -- Steve
On 27/04/21 02:54, Lai Jiangshan wrote: > The C NMI handler can handle the case of nested NMIs, which is useful > here. I think we should change it to call the C NMI handler directly > here as Andy Lutomirski suggested: Great, can you send a patch? Paolo > On Mon, Apr 26, 2021 at 11:09 PM Andy Lutomirski <luto@amacapital.net> wrote: >> The C NMI code has its own reentrancy protection and has for years. >> It should work fine for this use case. > > I think this is the right way. >
On Tue, Apr 27, 2021 at 3:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 27/04/21 02:54, Lai Jiangshan wrote: > > The C NMI handler can handle the case of nested NMIs, which is useful > > here. I think we should change it to call the C NMI handler directly > > here as Andy Lutomirski suggested: > > Great, can you send a patch? > Hello, I sent it several days ago, could you have a review please, and then I will update the patchset with feedbacks applied. And thanks Steven for the reviews. https://lore.kernel.org/lkml/20210426230949.3561-4-jiangshanlai@gmail.com/ thanks Lai
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 391f079d9136..b0eca151931d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6411,40 +6411,40 @@ static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu) void vmx_do_interrupt_nmi_irqoff(unsigned long entry); +static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info) +{ + unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK; + gate_desc *desc = (gate_desc *)host_idt_base + vector; + + kvm_before_interrupt(vcpu); + vmx_do_interrupt_nmi_irqoff(gate_offset(desc)); + kvm_after_interrupt(vcpu); +} + static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx) { u32 intr_info = vmx_get_intr_info(&vmx->vcpu); /* if exit due to PF check for async PF */ - if (is_page_fault(intr_info)) { + if (is_page_fault(intr_info)) vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags(); /* Handle machine checks before interrupts are enabled */ - } else if (is_machine_check(intr_info)) { + else if (is_machine_check(intr_info)) kvm_machine_check(); /* We need to handle NMIs before interrupts are enabled */ - } else if (is_nmi(intr_info)) { - kvm_before_interrupt(&vmx->vcpu); - asm("int $2"); - kvm_after_interrupt(&vmx->vcpu); - } + else if (is_nmi(intr_info)) + handle_interrupt_nmi_irqoff(&vmx->vcpu, intr_info); } static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu) { - unsigned int vector; - gate_desc *desc; u32 intr_info = vmx_get_intr_info(vcpu); if (WARN_ONCE(!is_external_intr(intr_info), "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info)) return; - vector = intr_info & INTR_INFO_VECTOR_MASK; - desc = (gate_desc *)host_idt_base + vector; - - kvm_before_interrupt(vcpu); - vmx_do_interrupt_nmi_irqoff(gate_offset(desc)); - kvm_after_interrupt(vcpu); + handle_interrupt_nmi_irqoff(vcpu, intr_info); } static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
Rework NMI VM-Exit handling to invoke the kernel handler by function call instead of INTn. INTn microcode is relatively expensive, and aligning the IRQ and NMI handling will make it easier to update KVM should some newfangled method for invoking the handlers come along. Suggested-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> --- arch/x86/kvm/vmx/vmx.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-)