kernel oops and panic in acpi_atomic_read under 2.6.39.3. call trace included

On Thursday, August 18, 2011, Rafael J. Wysocki wrote:
> Hi,
> 
> It's better to report ACPI failues to linux-acpi.
> 
> On Wednesday, August 17, 2011, Rick Warner wrote:
> > Hi all,
> > 
> > I am getting a kernel oops/panic on a dual xeon system that is the master of a 
> > 60 node HPC cluster.  This is happening while stress testing the system 
> > including significant network traffic.  The OS is openSuse 11.4. 
> > 
> > We are running a custom compiled 2.6.39.3 kernel on the systems due to a bug 
> > in the stock kernel 11.4 provided (igb driver related).  After 1-3 days of 
> > heavy testing, the master node locks up with the caps lock and scroll lock 
> > keys on the keyboard blinking with the following output captured via a serial 
> > console:
> > 
> > [381920.681113] BUG: unable to handle kernel NULL pointer dereference at           
> > (null)
> > [381920.689067] IP: [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
> > [381920.695187] PGD 30c27a067 PUD 16efe6067 PMD 0 
> > [381920.699782] Oops: 0000 [#1] PREEMPT SMP 
> > [381920.703866] last sysfs file: 
> > /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
> > [381920.711868] CPU 6 
> > [381920.713800] Modules linked in: md5 ipmi_devintf ipmi_si ipmi_msghandler 
> > nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE iptable_mangle 
> > iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter 
> > ip_tables x_tables af_packet edd cpufreq_conservative cpufreq_userspace 
> > cpufreq_powersave acpi_cpufreq mperf xfs dm_mod ioatdma i7core_edac edac_core 
> > sr_mod cdrom joydev igb i2c_i801 sg button ghes hed iTCO_wdt 
> > iTCO_vendor_support dca pcspkr ext4 jbd2 crc16 raid456 async_raid6_recov 
> > async_pq raid6_pq async_xor xor async_memcpy async_tx raid10 raid1 raid0 fan 
> > processor thermal thermal_sys ata_generic pata_atiixp arcmsr
> > [381920.771623] 
> > [381920.773210] Pid: 12701, comm: cluster Not tainted 2.6.39.3-microwaycustom 
> > #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
> > [381920.783292] RIP: 0010:[<ffffffff812a7510>]  [<ffffffff812a7510>] 
> > acpi_atomic_read+0xe3/0x120
> > [381920.791853] RSP: 0000:ffff88063fc47d98  EFLAGS: 00010046
> > [381920.797260] RAX: 0000000000000000 RBX: 00000000bf7b5ff0 RCX: ffffffff81a3cdd0
> > [381920.804486] RDX: 00000000bf7b6010 RSI: 00000000bf7b6000 RDI: 
> > ffff88062d4b95c0
> > [381920.811712] RBP: ffff88063fc47dc8 R08: ffff88063fc47d98 R09: 0000000000000002
> > [381920.818940] R10: 0000000000000083 R11: 0000000000000010 R12: 
> > ffffc90003044c20
> > [381920.826168] R13: ffff88063fc47de0 R14: 0000000000000000 R15: 
> > 0000000000000000
> > [381920.833392] FS:  0000000000000000(0000) GS:ffff88063fc40000(0000) 
> > knlGS:0000000000000000
> > [381920.841571] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> > [381920.847409] CR2: 0000000000000000 CR3: 00000001d2d0f000 CR4: 
> > 00000000000006e0
> > [381920.854635] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> > 0000000000000000
> > [381920.861861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [381920.869088] Process cluster (pid: 12701, threadinfo ffff880028284000, task 
> > ffff880550184380)
> > [381920.877608] Stack:
> > [381920.879717]  ffffffff81a3cdd0 00000000bf7b5ff0 ffff88062b2cc140 ffffc90003044ca8
> > [381920.887288]  ffff88062b2cc140 0000000000000001 ffff88063fc47e08 ffffffffa002b21f
> > [381920.894862]  0000000000000000 0000000000000000 ffff88062b2cc140 
> > 0000000000000000
> > [381920.902435] Call Trace:
> > [381920.904969]  <NMI> 
> > [381920.907201]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170 [ghes]
> > [381920.913726]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0 [ghes]
> > [381920.920076]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
> > [381920.926080]  [<ffffffff8151a1d8>] __atomic_notifier_call_chain+0x48/0x70
> > [381920.932864]  [<ffffffff8151a211>] atomic_notifier_call_chain+0x11/0x20
> > [381920.939473]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
> > [381920.944691]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
> > [381920.949649]  [<ffffffff81517550>] nmi+0x20/0x30
> > [381920.954269]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
> > [381920.960536]  <<EOE>> 
> > [381920.962725]  <IRQ> 
> > [381920.965132]  [<ffffffff8101131e>] intel_pmu_disable_all+0x3e/0x120
> > [381920.971399]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
> > [381920.977066]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
> > [381920.982817]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
> > [381920.989082]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
> > [381920.994740]  [<ffffffff81067097>] update_process_times+0x67/0x80
> > [381921.000832]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
> > [381921.006578]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
> > [381921.012758]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
> > [381921.018414]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
> > [381921.024343]  [<ffffffff810206d7>] smp_apic_timer_interrupt+0x67/0xa0
> > [381921.030793]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
> > [381921.036891]  <EOI> 
> > [381921.038913] Code: fc 10 74 1f 77 08 41 80 fc 08 75 48 eb 0e 41 80 fc 20 74 
> > 17 41 80 fc 40 75 3a eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb 09 <8b> 
> > 00 89 c0 eb 03 48 8b 00 49 89 45 00 e8 8e 2b e2 ff eb 1b 0f 
> > [381921.059462] RIP  [<ffffffff812a7510>] acpi_atomic_read+0xe3/0x120
> > [381921.065669]  RSP <ffff88063fc47d98>
> > [381921.069242] CR2: 0000000000000000
> > [381921.072645] ---[ end trace 52697bfc73a34a90 ]---
> > [381921.077343] Kernel panic - not syncing: Fatal exception in interrupt
> > [381921.083784] Pid: 12701, comm: cluster Tainted: G      D     2.6.39.3-
> > microwaycustom #1
> > [381921.091788] Call Trace:
> > [381921.094333]  <NMI>  [<ffffffff81513300>] panic+0x9f/0x1da
> > [381921.099881]  [<ffffffff81517f8c>] oops_end+0xdc/0xf0
> > [381921.104952]  [<ffffffff81032b91>] no_context+0xf1/0x260
> > [381921.110277]  [<ffffffff81032e55>] __bad_area_nosemaphore+0x155/0x200
> > [381921.116739]  [<ffffffff81032f0e>] bad_area_nosemaphore+0xe/0x10
> > [381921.122766]  [<ffffffff81519f46>] do_page_fault+0x366/0x530
> > [381921.128440]  [<ffffffff810ee929>] ? __perf_event_overflow+0xa9/0x220
> > [381921.134896]  [<ffffffff810ef99b>] ? perf_event_update_userpage+0x9b/0xe0
> > [381921.141698]  [<ffffffff81012249>] ? intel_pmu_enable_all+0xc9/0x1a0
> > [381921.148061]  [<ffffffff810123ff>] ? x86_perf_event_set_period+0xdf/0x170
> > [381921.154852]  [<ffffffff81517295>] page_fault+0x25/0x30
> > [381921.160081]  [<ffffffff812a7510>] ? acpi_atomic_read+0xe3/0x120
> > [381921.166094]  [<ffffffff812a7486>] ? acpi_atomic_read+0x59/0x120
> > [381921.172107]  [<ffffffffa002b21f>] ghes_read_estatus+0x2f/0x170 [ghes]
> > [381921.178639]  [<ffffffffa002b618>] ghes_notify_nmi+0xd8/0x1b0 [ghes]
> > [381921.185004]  [<ffffffff8151a14f>] notifier_call_chain+0x3f/0x80
> > [381921.191003]  [<ffffffff8151a1d8>] __atomic_notifier_call_chain+0x48/0x70
> > [381921.197789]  [<ffffffff8151a211>] atomic_notifier_call_chain+0x11/0x20
> > [381921.204399]  [<ffffffff8151a24e>] notify_die+0x2e/0x30
> > [381921.209624]  [<ffffffff81517bd2>] do_nmi+0xa2/0x270
> > [381921.214584]  [<ffffffff81517550>] nmi+0x20/0x30
> > [381921.219208]  [<ffffffff8102ae8a>] ? native_write_msr_safe+0xa/0x10
> > [381921.225470]  <<EOE>>  <IRQ>  [<ffffffff8101131e>] 
> > intel_pmu_disable_all+0x3e/0x120
> > [381921.233174]  [<ffffffff81010d5a>] x86_pmu_disable+0x4a/0x50
> > [381921.238836]  [<ffffffff810e7f2b>] perf_pmu_disable+0x2b/0x40
> > [381921.244582]  [<ffffffff810ee828>] perf_event_task_tick+0x218/0x270
> > [381921.250855]  [<ffffffff81046f4d>] scheduler_tick+0xdd/0x2c0
> > [381921.256520]  [<ffffffff81067097>] update_process_times+0x67/0x80
> > [381921.262618]  [<ffffffff81089eef>] tick_sched_timer+0x5f/0xc0
> > [381921.268363]  [<ffffffff81089e90>] ? tick_nohz_handler+0x100/0x100
> > [381921.274540]  [<ffffffff8107db3d>] __run_hrtimer+0x12d/0x280
> > [381921.280198]  [<ffffffff8107dea7>] hrtimer_interrupt+0xb7/0x1e0
> > [381921.286125]  [<ffffffff810206d7>] smp_apic_timer_interrupt+0x67/0xa0
> > [381921.292578]  [<ffffffff8151e3f3>] apic_timer_interrupt+0x13/0x20
> > [381921.298673]  <EOI>
> > 
> > 
> > I recompiled the kernel again, disabling the tickless feature as I saw "nohz" 
> > early in the call trace.  After that, it reproduced again but the call trace 
> > only changed slightly having tick_init_highres in there instead of 
> > tick_nohz_handler.  I have that call trace if it is desired as well.  It is 
> > nearly identical though.
> > 
> > I am currently trying a custom 2.6.36.4 on the system, but will need up to 3 
> > days before I know if the problem exists there as well.
> > 
> > Any ideas on this?
> 
> It looks like a wrong address in ghes_read_estatus().  I'll have a look at it
> later today.

Hmm.  Please apply the appended patch and see if the error is still
reproducible (it's on top of the current mainline, so you may need to
adjust it by hand).

Thanks,
Rafael

---
 drivers/acpi/apei/ghes.c |    3 +++
 1 file changed, 3 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

kernel oops and panic in acpi_atomic_read under 2.6.39.3. call trace included

Commit Message

Comments

Patch