Message ID | 1480445039-3434-1-git-send-email-prarit@redhat.com (mailing list archive) |
---|---|
State | Changes Requested, archived |
Headers | show |
On Tue, Nov 29, 2016 at 01:43:59PM -0500, Prarit Bhargava wrote: > When removing and adding cpu 0 on a system with GHES NMI the following stack > trace is seen when re-adding the cpu: > > WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1349 setup_local_APIC+ > Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache coretemp intel_ra > CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-rc5+ #59 > Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.01.00.0 > ffffffff81c03e78 ffffffff81337905 0000000000000000 0000000000000000 > ffffffff81c03eb8 ffffffff8107d9c1 00000545810aac4a 0000000000000000 > 00000000000000f0 0000000000000000 000081cb6440f1d0 0000000000000001 > Call Trace: > [<ffffffff81337905>] dump_stack+0x63/0x8e > [<ffffffff8107d9c1>] __warn+0xd1/0xf0 > [<ffffffff8107daad>] warn_slowpath_null+0x1d/0x20 > [<ffffffff810522b5>] setup_local_APIC+0x275/0x370 > [<ffffffff810523be>] apic_ap_setup+0xe/0x20 > [<ffffffff8104f5a8>] start_secondary+0x48/0x180 > [<ffffffff81d89aa0>] ? set_init_arg+0x55/0x55 > [<ffffffff81d89120>] ? early_idt_handler_array+0x120/0x120 > [<ffffffff81d895d6>] ? x86_64_start_reservations+0x2a/0x2c > [<ffffffff81d89715>] ? x86_64_start_kernel+0x13d/0x14c > ---[ end trace 7b6555b6343ef9ee ]--- Please remove all hex numbers from the splat - they're useless in the commit message. > During the cpu bringup, wakeup_cpu_via_init_nmi() is called and issues an > NMI on CPU 0. The GHES NMI handler, ghes_notify_nmi() runs the > ghes_proc_irq_work work queue which ends up setting IRQ_WORK_VECTOR > (0xf6). The "faulty" IR line set at arch/x86/kernel/apic/apic.c:1349 is also > 0xf6 (specifically APIC IRR for irqs 255 to 224 is 0x400000) which confirms > that something has set the IRQ_WORK_VECTOR line prior to the APIC being > initialized. > > Commit 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") > incorrectly modified the behavior such that the handler returns > NMI_HANDLED only if an error was processed, and incorrectly runs the ghes > work queue for every NMI. > > This patch modifies the ghes_proc_irq_work() to run as it did prior to > 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") by > properly returning NMI_HANDLED and only calling the work queue if > NMI_HANDLED has been set. > > Fixes: 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") > Signed-off-by: Prarit Bhargava <prarit@redhat.com> > Cc: Borislav Petkov <bp@suse.de> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Cc: Len Brown <lenb@kernel.org> > Cc: Paul Gortmaker <paul.gortmaker@windriver.com> > Cc: Tyler Baicar <tbaicar@codeaurora.org> > Cc: Punit Agrawal <punit.agrawal@arm.com> > Cc: Don Zickus <dzickus@redhat.com> > Cc: linux-acpi@vger.kernel.org > --- > drivers/acpi/apei/ghes.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index 0d099a24f776..39c45efbcb3d 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -858,17 +858,18 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) > if (sev >= GHES_SEV_PANIC) > __ghes_panic(ghes); > > + ret = NMI_HANDLED; > + Make that more explicit: if (ghes_read_estatus(ghes, 1)) { ghes_clear_estatus(ghes); continue; } else { ret = NMI_HANDLED; } > if (!(ghes->flags & GHES_TO_CLEAR)) > continue; > > __process_error(ghes); > ghes_clear_estatus(ghes); > - > - ret = NMI_HANDLED; > } > > #ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG > - irq_work_queue(&ghes_proc_irq_work); > + if (ret == NMI_HANDLED) > + irq_work_queue(&ghes_proc_irq_work); > #endif > atomic_dec(&ghes_in_nmi); > return ret; > -- Otherwise looks ok, thanks.
On Thursday, December 01, 2016 09:07:39 PM Borislav Petkov wrote: > On Wed, Nov 30, 2016 at 08:19:39AM -0500, Prarit Bhargava wrote: > > When removing and adding cpu 0 on a system with GHES NMI the following stack > > trace is seen when re-adding the cpu: > > > > WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1349 setup_local_APIC+ > > Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache coretemp intel_ra > > CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-rc6+ #2 > > Call Trace: > > dump_stack+0x63/0x8e > > __warn+0xd1/0xf0 > > warn_slowpath_null+0x1d/0x20 > > setup_local_APIC+0x275/0x370 > > apic_ap_setup+0xe/0x20 > > start_secondary+0x48/0x180 > > set_init_arg+0x55/0x55 > > early_idt_handler_array+0x120/0x120 > > x86_64_start_reservations+0x2a/0x2c > > x86_64_start_kernel+0x13d/0x14c > > > > During the cpu bringup, wakeup_cpu_via_init_nmi() is called and issues an > > NMI on CPU 0. The GHES NMI handler, ghes_notify_nmi() runs the > > ghes_proc_irq_work work queue which ends up setting IRQ_WORK_VECTOR > > (0xf6). The "faulty" IR line set at arch/x86/kernel/apic/apic.c:1349 is also > > 0xf6 (specifically APIC IRR for irqs 255 to 224 is 0x400000) which confirms > > that something has set the IRQ_WORK_VECTOR line prior to the APIC being > > initialized. > > > > Commit 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") > > incorrectly modified the behavior such that the handler returns > > NMI_HANDLED only if an error was processed, and incorrectly runs the ghes > > work queue for every NMI. > > > > This patch modifies the ghes_proc_irq_work() to run as it did prior to > > 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") by > > properly returning NMI_HANDLED and only calling the work queue if > > NMI_HANDLED has been set. > > > > v2: Borislav, setting of NMI_HANDLED moved & cleaned up changelog. > > > > Fixes: 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") > > Signed-off-by: Prarit Bhargava <prarit@redhat.com> > > Cc: Borislav Petkov <bp@suse.de> > > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Cc: Len Brown <lenb@kernel.org> > > Cc: Paul Gortmaker <paul.gortmaker@windriver.com> > > Cc: Tyler Baicar <tbaicar@codeaurora.org> > > Cc: Punit Agrawal <punit.agrawal@arm.com> > > Cc: Don Zickus <dzickus@redhat.com> > > --- > > drivers/acpi/apei/ghes.c | 7 ++++--- > > 1 file changed, 4 insertions(+), 3 deletions(-) > > Reviewed-by: Borislav Petkov <bp@suse.de> I guess I should pick up this one, then? Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday, December 01, 2016 10:51:49 PM Borislav Petkov wrote: > On Thu, Dec 01, 2016 at 10:17:54PM +0100, Rafael J. Wysocki wrote: > > I guess I should pick up this one, then? > > If you already have stuff touching this area, it probably would be more > prudent if you took it. If not, I can take it through tip's ras branch, > if you prefer. Well, there's another ARM-related patch touching APEI. I guess whoever takes this one should also take the other one and honestly they can go in via any tree as far as I'm concerned, I'm just trying to avoid merge clashes here. :-) Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Dec 01, 2016 at 11:29:45PM +0100, Rafael J. Wysocki wrote: > Well, there's another ARM-related patch touching APEI. > > I guess whoever takes this one should also take the other one and > honestly they can go in via any tree as far as I'm concerned, I'm just trying to > avoid merge clashes here. :-) Maybe have ARM-folk ACK the other one and take both through your ACPI tree? They both do have ACPI in common :-)
On Thursday, December 01, 2016 11:47:17 PM Borislav Petkov wrote: > On Thu, Dec 01, 2016 at 11:29:45PM +0100, Rafael J. Wysocki wrote: > > Well, there's another ARM-related patch touching APEI. > > > > I guess whoever takes this one should also take the other one and > > honestly they can go in via any tree as far as I'm concerned, I'm just trying to > > avoid merge clashes here. :-) > > Maybe have ARM-folk ACK the other one and take both through your ACPI > tree? They both do have ACPI in common :-) That one have been ACKed already. OK, I'll take them both. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016/12/2 7:12, Rafael J. Wysocki wrote: > On Thursday, December 01, 2016 11:47:17 PM Borislav Petkov wrote: >> On Thu, Dec 01, 2016 at 11:29:45PM +0100, Rafael J. Wysocki wrote: >>> Well, there's another ARM-related patch touching APEI. >>> >>> I guess whoever takes this one should also take the other one and >>> honestly they can go in via any tree as far as I'm concerned, I'm just trying to >>> avoid merge clashes here. :-) >> Maybe have ARM-folk ACK the other one and take both through your ACPI >> tree? They both do have ACPI in common :-) > That one have been ACKed already. > > OK, I'll take them both. Thank you very much :) Hanjun -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Rafael, On 2 December 2016 at 07:12, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > On Thursday, December 01, 2016 11:47:17 PM Borislav Petkov wrote: >> On Thu, Dec 01, 2016 at 11:29:45PM +0100, Rafael J. Wysocki wrote: >> > Well, there's another ARM-related patch touching APEI. >> > >> > I guess whoever takes this one should also take the other one and >> > honestly they can go in via any tree as far as I'm concerned, I'm just trying to >> > avoid merge clashes here. :-) >> >> Maybe have ARM-folk ACK the other one and take both through your ACPI >> tree? They both do have ACPI in common :-) > > That one have been ACKed already. I guess that is "[PATCH v15] acpi, apei, arm64: APEI initial support for aarch64." > > OK, I'll take them both. Great thanks, Rafael :-) > > Thanks, > Rafael > > -- > To unsubscribe from this list: send the line "unsubscribe linux-acpi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 0d099a24f776..39c45efbcb3d 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -858,17 +858,18 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) if (sev >= GHES_SEV_PANIC) __ghes_panic(ghes); + ret = NMI_HANDLED; + if (!(ghes->flags & GHES_TO_CLEAR)) continue; __process_error(ghes); ghes_clear_estatus(ghes); - - ret = NMI_HANDLED; } #ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG - irq_work_queue(&ghes_proc_irq_work); + if (ret == NMI_HANDLED) + irq_work_queue(&ghes_proc_irq_work); #endif atomic_dec(&ghes_in_nmi); return ret;
When removing and adding cpu 0 on a system with GHES NMI the following stack trace is seen when re-adding the cpu: WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1349 setup_local_APIC+ Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache coretemp intel_ra CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-rc5+ #59 Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.01.00.0 ffffffff81c03e78 ffffffff81337905 0000000000000000 0000000000000000 ffffffff81c03eb8 ffffffff8107d9c1 00000545810aac4a 0000000000000000 00000000000000f0 0000000000000000 000081cb6440f1d0 0000000000000001 Call Trace: [<ffffffff81337905>] dump_stack+0x63/0x8e [<ffffffff8107d9c1>] __warn+0xd1/0xf0 [<ffffffff8107daad>] warn_slowpath_null+0x1d/0x20 [<ffffffff810522b5>] setup_local_APIC+0x275/0x370 [<ffffffff810523be>] apic_ap_setup+0xe/0x20 [<ffffffff8104f5a8>] start_secondary+0x48/0x180 [<ffffffff81d89aa0>] ? set_init_arg+0x55/0x55 [<ffffffff81d89120>] ? early_idt_handler_array+0x120/0x120 [<ffffffff81d895d6>] ? x86_64_start_reservations+0x2a/0x2c [<ffffffff81d89715>] ? x86_64_start_kernel+0x13d/0x14c ---[ end trace 7b6555b6343ef9ee ]--- During the cpu bringup, wakeup_cpu_via_init_nmi() is called and issues an NMI on CPU 0. The GHES NMI handler, ghes_notify_nmi() runs the ghes_proc_irq_work work queue which ends up setting IRQ_WORK_VECTOR (0xf6). The "faulty" IR line set at arch/x86/kernel/apic/apic.c:1349 is also 0xf6 (specifically APIC IRR for irqs 255 to 224 is 0x400000) which confirms that something has set the IRQ_WORK_VECTOR line prior to the APIC being initialized. Commit 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") incorrectly modified the behavior such that the handler returns NMI_HANDLED only if an error was processed, and incorrectly runs the ghes work queue for every NMI. This patch modifies the ghes_proc_irq_work() to run as it did prior to 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") by properly returning NMI_HANDLED and only calling the work queue if NMI_HANDLED has been set. Fixes: 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Tyler Baicar <tbaicar@codeaurora.org> Cc: Punit Agrawal <punit.agrawal@arm.com> Cc: Don Zickus <dzickus@redhat.com> Cc: linux-acpi@vger.kernel.org --- drivers/acpi/apei/ghes.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)