Message ID | 20240124084014.1772906-1-linmiaohe@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page | expand |
On Wed, Jan 24, 2024 at 04:40:14PM +0800, Miaohe Lin wrote: > When I did soft offline stress test, a machine was observed to crash with > the following message: > > kernel BUG at include/linux/memcontrol.h:554! > invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > CPU: 5 PID: 3837 Comm: hwpoison.sh Not tainted 6.7.0-next-20240112-00001-g8ecf3e7fb7c8-dirty #97 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 > RIP: 0010:folio_memcg+0xaf/0xd0 > Code: 10 5b 5d c3 cc cc cc cc 48 c7 c6 08 b1 f2 b2 48 89 ef e8 b4 c5 f8 ff 90 0f 0b 48 c7 c6 d0 b0 f2 b2 48 89 ef e8 a2 c5 f8 ff 90 <0f> 0b 48 c7 c6 08 b1 f2 b2 48 89 ef e8 90 c5 f8 ff 90 0f 0b 66 66 > RSP: 0018:ffffb6c043657c98 EFLAGS: 00000296 > RAX: 000000000000004b RBX: ffff932bc1d1e401 RCX: ffff933abfb5c908 > RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff933abfb5c900 > RBP: ffffea6f04019080 R08: ffffffffb3338ce8 R09: 0000000000009ffb > R10: 00000000000004dd R11: ffffffffb3308d00 R12: ffffea6f04019080 > R13: ffffea6f04019080 R14: 0000000000000001 R15: ffffb6c043657da0 > FS: 00007f6c60f6b740(0000) GS:ffff933abfb40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000559c3bc8b980 CR3: 0000000107f1c000 CR4: 00000000000006f0 > Call Trace: > <TASK> > ? die+0x32/0x90 > ? do_trap+0xde/0x110 > ? folio_memcg+0xaf/0xd0 > ? do_error_trap+0x60/0x80 > ? folio_memcg+0xaf/0xd0 > ? exc_invalid_op+0x53/0x70 > ? folio_memcg+0xaf/0xd0 > ? asm_exc_invalid_op+0x1a/0x20 > ? folio_memcg+0xaf/0xd0 > ? folio_memcg+0xae/0xd0 I might trim these ? lines out of the backtrace ... > split_huge_page_to_list+0x4d/0x1380 > ? sysvec_apic_timer_interrupt+0xf/0x80 > try_to_split_thp_page+0x3a/0xf0 > soft_offline_page+0x1ea/0x8a0 > soft_offline_page_store+0x52/0x90 > kernfs_fop_write_iter+0x118/0x1b0 > vfs_write+0x30b/0x430 > ksys_write+0x5e/0xe0 > do_syscall_64+0xb0/0x1b0 > entry_SYSCALL_64_after_hwframe+0x6d/0x75 > RIP: 0033:0x7f6c60d14697 > Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 > RSP: 002b:00007ffe9b72b8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6c60d14697 > RDX: 000000000000000c RSI: 0000559c3bc8b980 RDI: 0000000000000001 > RBP: 0000559c3bc8b980 R08: 00007f6c60dd1460 R09: 000000007fffffff > R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c > R13: 00007f6c60e1a780 R14: 00007f6c60e16600 R15: 00007f6c60e15a00 > > The problem is that page->mapping is overloaded with slab->slab_list or > slabs fields now, so slab pages could be taken as non-LRU movable pages > if field slabs contains PAGE_MAPPING_MOVABLE or slab_list->prev is set > to LIST_POISON2. These slab pages will be treated as thp later leading > to crash in split_huge_page_to_list(). > > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> > Fixes: 130d4df57390 ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head") Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
On 2024/1/24 21:15, Matthew Wilcox wrote: > On Wed, Jan 24, 2024 at 04:40:14PM +0800, Miaohe Lin wrote: >> When I did soft offline stress test, a machine was observed to crash with >> the following message: >> >> kernel BUG at include/linux/memcontrol.h:554! >> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI >> CPU: 5 PID: 3837 Comm: hwpoison.sh Not tainted 6.7.0-next-20240112-00001-g8ecf3e7fb7c8-dirty #97 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 >> RIP: 0010:folio_memcg+0xaf/0xd0 >> Code: 10 5b 5d c3 cc cc cc cc 48 c7 c6 08 b1 f2 b2 48 89 ef e8 b4 c5 f8 ff 90 0f 0b 48 c7 c6 d0 b0 f2 b2 48 89 ef e8 a2 c5 f8 ff 90 <0f> 0b 48 c7 c6 08 b1 f2 b2 48 89 ef e8 90 c5 f8 ff 90 0f 0b 66 66 >> RSP: 0018:ffffb6c043657c98 EFLAGS: 00000296 >> RAX: 000000000000004b RBX: ffff932bc1d1e401 RCX: ffff933abfb5c908 >> RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff933abfb5c900 >> RBP: ffffea6f04019080 R08: ffffffffb3338ce8 R09: 0000000000009ffb >> R10: 00000000000004dd R11: ffffffffb3308d00 R12: ffffea6f04019080 >> R13: ffffea6f04019080 R14: 0000000000000001 R15: ffffb6c043657da0 >> FS: 00007f6c60f6b740(0000) GS:ffff933abfb40000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 0000559c3bc8b980 CR3: 0000000107f1c000 CR4: 00000000000006f0 >> Call Trace: >> <TASK> >> ? die+0x32/0x90 >> ? do_trap+0xde/0x110 >> ? folio_memcg+0xaf/0xd0 >> ? do_error_trap+0x60/0x80 >> ? folio_memcg+0xaf/0xd0 >> ? exc_invalid_op+0x53/0x70 >> ? folio_memcg+0xaf/0xd0 >> ? asm_exc_invalid_op+0x1a/0x20 >> ? folio_memcg+0xaf/0xd0 >> ? folio_memcg+0xae/0xd0 > > I might trim these ? lines out of the backtrace ... Do you mean make backtrace looks like something below? Call Trace: <TASK> split_huge_page_to_list+0x4d/0x1380 ? sysvec_apic_timer_interrupt+0xf/0x80 try_to_split_thp_page+0x3a/0xf0 soft_offline_page+0x1ea/0x8a0 soft_offline_page_store+0x52/0x90 kernfs_fop_write_iter+0x118/0x1b0 vfs_write+0x30b/0x430 ksys_write+0x5e/0xe0 do_syscall_64+0xb0/0x1b0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 RIP: 0033:0x7f6c60d14697 > >> split_huge_page_to_list+0x4d/0x1380 >> ? sysvec_apic_timer_interrupt+0xf/0x80 >> try_to_split_thp_page+0x3a/0xf0 >> soft_offline_page+0x1ea/0x8a0 >> soft_offline_page_store+0x52/0x90 >> kernfs_fop_write_iter+0x118/0x1b0 >> vfs_write+0x30b/0x430 >> ksys_write+0x5e/0xe0 >> do_syscall_64+0xb0/0x1b0 >> entry_SYSCALL_64_after_hwframe+0x6d/0x75 >> RIP: 0033:0x7f6c60d14697 >> Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 >> RSP: 002b:00007ffe9b72b8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 >> RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6c60d14697 >> RDX: 000000000000000c RSI: 0000559c3bc8b980 RDI: 0000000000000001 >> RBP: 0000559c3bc8b980 R08: 00007f6c60dd1460 R09: 000000007fffffff >> R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c >> R13: 00007f6c60e1a780 R14: 00007f6c60e16600 R15: 00007f6c60e15a00 >> >> The problem is that page->mapping is overloaded with slab->slab_list or >> slabs fields now, so slab pages could be taken as non-LRU movable pages >> if field slabs contains PAGE_MAPPING_MOVABLE or slab_list->prev is set >> to LIST_POISON2. These slab pages will be treated as thp later leading >> to crash in split_huge_page_to_list(). >> >> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> >> Fixes: 130d4df57390 ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head") > > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Many thanks for your review.
On Thu, Jan 25, 2024 at 07:53:25PM +0800, Miaohe Lin wrote: > On 2024/1/24 21:15, Matthew Wilcox wrote: > >> Call Trace: > >> <TASK> > >> ? die+0x32/0x90 > >> ? do_trap+0xde/0x110 > >> ? folio_memcg+0xaf/0xd0 > >> ? do_error_trap+0x60/0x80 > >> ? folio_memcg+0xaf/0xd0 > >> ? exc_invalid_op+0x53/0x70 > >> ? folio_memcg+0xaf/0xd0 > >> ? asm_exc_invalid_op+0x1a/0x20 > >> ? folio_memcg+0xaf/0xd0 > >> ? folio_memcg+0xae/0xd0 > > > > I might trim these ? lines out of the backtrace ... > > Do you mean make backtrace looks like something below? > > Call Trace: > <TASK> > split_huge_page_to_list+0x4d/0x1380 > ? sysvec_apic_timer_interrupt+0xf/0x80 > try_to_split_thp_page+0x3a/0xf0 > soft_offline_page+0x1ea/0x8a0 > soft_offline_page_store+0x52/0x90 > kernfs_fop_write_iter+0x118/0x1b0 > vfs_write+0x30b/0x430 > ksys_write+0x5e/0xe0 > do_syscall_64+0xb0/0x1b0 > entry_SYSCALL_64_after_hwframe+0x6d/0x75 > RIP: 0033:0x7f6c60d14697 Yes. I'd trim the sysvec_apic_timer_interrupt+0xf/0x80 line too. These lines aren't actually part of the call trace. They're addresses that the unwinder found on the stack but don't actually fit the call trace. It puts them in in case they're helpful, but marks them with a ? to indicate that they're probably not part of the call trace.
On 2024/1/25 22:22, Matthew Wilcox wrote: > On Thu, Jan 25, 2024 at 07:53:25PM +0800, Miaohe Lin wrote: >> On 2024/1/24 21:15, Matthew Wilcox wrote: >>>> Call Trace: >>>> <TASK> >>>> ? die+0x32/0x90 >>>> ? do_trap+0xde/0x110 >>>> ? folio_memcg+0xaf/0xd0 >>>> ? do_error_trap+0x60/0x80 >>>> ? folio_memcg+0xaf/0xd0 >>>> ? exc_invalid_op+0x53/0x70 >>>> ? folio_memcg+0xaf/0xd0 >>>> ? asm_exc_invalid_op+0x1a/0x20 >>>> ? folio_memcg+0xaf/0xd0 >>>> ? folio_memcg+0xae/0xd0 >>> >>> I might trim these ? lines out of the backtrace ... >> >> Do you mean make backtrace looks like something below? >> >> Call Trace: >> <TASK> >> split_huge_page_to_list+0x4d/0x1380 >> ? sysvec_apic_timer_interrupt+0xf/0x80 >> try_to_split_thp_page+0x3a/0xf0 >> soft_offline_page+0x1ea/0x8a0 >> soft_offline_page_store+0x52/0x90 >> kernfs_fop_write_iter+0x118/0x1b0 >> vfs_write+0x30b/0x430 >> ksys_write+0x5e/0xe0 >> do_syscall_64+0xb0/0x1b0 >> entry_SYSCALL_64_after_hwframe+0x6d/0x75 >> RIP: 0033:0x7f6c60d14697 > > Yes. I'd trim the sysvec_apic_timer_interrupt+0xf/0x80 line too. > These lines aren't actually part of the call trace. They're addresses > that the unwinder found on the stack but don't actually fit the call > trace. It puts them in in case they're helpful, but marks them with a ? > to indicate that they're probably not part of the call trace. I see. Many thanks for your explanation. Will update backtrace in next version. Thanks.
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 636280d04008..9349948f1abf 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1377,6 +1377,9 @@ void ClearPageHWPoisonTakenOff(struct page *page) */ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags) { + if (PageSlab(page)) + return false; + /* Soft offline could migrate non-LRU movable pages */ if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page)) return true;
When I did soft offline stress test, a machine was observed to crash with the following message: kernel BUG at include/linux/memcontrol.h:554! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 5 PID: 3837 Comm: hwpoison.sh Not tainted 6.7.0-next-20240112-00001-g8ecf3e7fb7c8-dirty #97 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 RIP: 0010:folio_memcg+0xaf/0xd0 Code: 10 5b 5d c3 cc cc cc cc 48 c7 c6 08 b1 f2 b2 48 89 ef e8 b4 c5 f8 ff 90 0f 0b 48 c7 c6 d0 b0 f2 b2 48 89 ef e8 a2 c5 f8 ff 90 <0f> 0b 48 c7 c6 08 b1 f2 b2 48 89 ef e8 90 c5 f8 ff 90 0f 0b 66 66 RSP: 0018:ffffb6c043657c98 EFLAGS: 00000296 RAX: 000000000000004b RBX: ffff932bc1d1e401 RCX: ffff933abfb5c908 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff933abfb5c900 RBP: ffffea6f04019080 R08: ffffffffb3338ce8 R09: 0000000000009ffb R10: 00000000000004dd R11: ffffffffb3308d00 R12: ffffea6f04019080 R13: ffffea6f04019080 R14: 0000000000000001 R15: ffffb6c043657da0 FS: 00007f6c60f6b740(0000) GS:ffff933abfb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000559c3bc8b980 CR3: 0000000107f1c000 CR4: 00000000000006f0 Call Trace: <TASK> ? die+0x32/0x90 ? do_trap+0xde/0x110 ? folio_memcg+0xaf/0xd0 ? do_error_trap+0x60/0x80 ? folio_memcg+0xaf/0xd0 ? exc_invalid_op+0x53/0x70 ? folio_memcg+0xaf/0xd0 ? asm_exc_invalid_op+0x1a/0x20 ? folio_memcg+0xaf/0xd0 ? folio_memcg+0xae/0xd0 split_huge_page_to_list+0x4d/0x1380 ? sysvec_apic_timer_interrupt+0xf/0x80 try_to_split_thp_page+0x3a/0xf0 soft_offline_page+0x1ea/0x8a0 soft_offline_page_store+0x52/0x90 kernfs_fop_write_iter+0x118/0x1b0 vfs_write+0x30b/0x430 ksys_write+0x5e/0xe0 do_syscall_64+0xb0/0x1b0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 RIP: 0033:0x7f6c60d14697 Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007ffe9b72b8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6c60d14697 RDX: 000000000000000c RSI: 0000559c3bc8b980 RDI: 0000000000000001 RBP: 0000559c3bc8b980 R08: 00007f6c60dd1460 R09: 000000007fffffff R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c R13: 00007f6c60e1a780 R14: 00007f6c60e16600 R15: 00007f6c60e15a00 The problem is that page->mapping is overloaded with slab->slab_list or slabs fields now, so slab pages could be taken as non-LRU movable pages if field slabs contains PAGE_MAPPING_MOVABLE or slab_list->prev is set to LIST_POISON2. These slab pages will be treated as thp later leading to crash in split_huge_page_to_list(). Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Fixes: 130d4df57390 ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head") --- v2: Check PageSlab() first to leave the rest code alone per Matthew. --- mm/memory-failure.c | 3 +++ 1 file changed, 3 insertions(+)