Message ID | 49273e6688d7571756603dac996692a15f245d58.1649603963.git.xuyu@linux.alibaba.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/memory-failure.c: bail out early if huge zero page | expand |
On 2022/4/10 23:22, Xu Yu wrote: > Kernel panic when injecting memory_failure for the global huge_zero_page, > when CONFIG_DEBUG_VM is enabled, as follows. > > [ 5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000 > [ 5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00 > [ 5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0 > [ 5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff) > [ 5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000 > [ 5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000 > [ 5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head)) > [ 5.589398] ------------[ cut here ]------------ > [ 5.589952] kernel BUG at mm/huge_memory.c:2499! > [ 5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI > [ 5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11 > [ 5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014 > [ 5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880 > [ 5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b > [ 5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246 > [ 5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000 > [ 5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff > [ 5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff > [ 5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000 > [ 5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40 > [ 5.600693] FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000 > [ 5.601640] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0 > [ 5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 5.604806] Call Trace: > [ 5.605101] <TASK> > [ 5.605357] ? __irq_work_queue_local+0x39/0x70 > [ 5.605904] try_to_split_thp_page+0x3a/0x130 > [ 5.606430] memory_failure+0x128/0x800 > [ 5.606888] madvise_inject_error.cold+0x8b/0xa1 > [ 5.607444] __x64_sys_madvise+0x54/0x60 > [ 5.607915] do_syscall_64+0x35/0x80 > [ 5.608347] entry_SYSCALL_64_after_hwframe+0x44/0xae > [ 5.608949] RIP: 0033:0x7fc3754f8bf9 > [ 5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 > [ 5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c > [ 5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9 > [ 5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000 > [ 5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000 > [ 5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490 > [ 5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000 > [ 5.616626] </TASK> > Thanks for the report and the patch! I remember I and Naoya discussed the try_to_split_thp_page in memory_failure might come across non-lru movable compound page and huge_zero_page. We fixed the non-lru movable compound page case but conclude huge_zero_page won't reach here due to the HWPoisonHandlable() check. But we missed the MF_COUNT_INCREASED case where HWPoisonHandlable() is skipped. > In fact, huge_zero_page is unhandlable currently in either soft offline > or memory failure injection. With CONFIG_DEBUG_VM disabled, > huge_zero_page is bailed out when checking HWPoisonHandlable() in > get_any_page(), or checking page mapping in split_huge_page_to_list(). > > This makes huge_zero_page bail out early in madvise_inject_error(), and > panic above won't happen again. It seems this issue is expected to happen only in madvise_inject_error case because MF_COUNT_INCREASED is only set here. So this fix should do the right thing. But I don't know whether bail out early for huge_zero_page is suitable. Hi Naoya, what do you think? Thanks both! > > Reported-by: Abaci <abaci@linux.alibaba.com> > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> > --- > mm/madvise.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 1873616a37d2..03ad50d222e0 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1079,7 +1079,7 @@ static int madvise_inject_error(int behavior, > > for (; start < end; start += size) { > unsigned long pfn; > - struct page *page; > + struct page *page, *head; > int ret; > > ret = get_user_pages_fast(start, 1, 0, &page); > @@ -1087,12 +1087,21 @@ static int madvise_inject_error(int behavior, > return ret; > pfn = page_to_pfn(page); > > + head = compound_head(page); > + if (unlikely(is_huge_zero_page(head))) { > + pr_warn("Unhandlable attempt to %s pfn %#lx at process virtual address %#lx\n", > + behavior == MADV_SOFT_OFFLINE ? "soft offline" : > + "inject memory failure for", > + pfn, start); > + return -EINVAL; > + } > + > /* > * When soft offlining hugepages, after migrating the page > * we dissolve it, therefore in the second loop "page" will > * no longer be a compound page. > */ > - size = page_size(compound_head(page)); > + size = page_size(head); > > if (behavior == MADV_SOFT_OFFLINE) { > pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n", >
On Sun, Apr 10, 2022 at 11:22:34PM +0800, Xu Yu wrote: > Kernel panic when injecting memory_failure for the global huge_zero_page, > when CONFIG_DEBUG_VM is enabled, as follows. ... > In fact, huge_zero_page is unhandlable currently in either soft offline > or memory failure injection. With CONFIG_DEBUG_VM disabled, > huge_zero_page is bailed out when checking HWPoisonHandlable() in > get_any_page(), or checking page mapping in split_huge_page_to_list(). > > This makes huge_zero_page bail out early in madvise_inject_error(), and > panic above won't happen again. I would not special case this in madvise_inject_error() but rather handle it in memory-failure code. We do already have HWPoisonHandlable(), which tells us whether the page is of a type we can really do something about, so why not add another check in HWPoisonHandlable() for huge_zero_page(), and have that checked in memory_failure(). Something like (untested): diff --git a/mm/memory-failure.c b/mm/memory-failure.c index dcb6bb9cf731..dccd0503f803 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1181,6 +1181,10 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags) { bool movable = false; + /* Can't handle huge_zero_page() */ + if(is_huge_zero_page(compound_head(page))) + return false; + /* Soft offline could mirgate non-LRU movable pages */ if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page)) movable = true; @@ -1796,6 +1800,10 @@ int memory_failure(unsigned long pfn, int flags) res = -EBUSY; goto unlock_mutex; } + } else if(!HWPoisonHandable(p)) { + action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); + res = -EBUSY; + goto unlock_mutex; } if (PageTransHuge(hpage)) { It can certainly be prettier, but you can get the idea. > > Reported-by: Abaci <abaci@linux.alibaba.com> > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> > --- > mm/madvise.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 1873616a37d2..03ad50d222e0 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1079,7 +1079,7 @@ static int madvise_inject_error(int behavior, > > for (; start < end; start += size) { > unsigned long pfn; > - struct page *page; > + struct page *page, *head; > int ret; > > ret = get_user_pages_fast(start, 1, 0, &page); > @@ -1087,12 +1087,21 @@ static int madvise_inject_error(int behavior, > return ret; > pfn = page_to_pfn(page); > > + head = compound_head(page); > + if (unlikely(is_huge_zero_page(head))) { > + pr_warn("Unhandlable attempt to %s pfn %#lx at process virtual address %#lx\n", > + behavior == MADV_SOFT_OFFLINE ? "soft offline" : > + "inject memory failure for", > + pfn, start); > + return -EINVAL; > + } > + > /* > * When soft offlining hugepages, after migrating the page > * we dissolve it, therefore in the second loop "page" will > * no longer be a compound page. > */ > - size = page_size(compound_head(page)); > + size = page_size(head); > > if (behavior == MADV_SOFT_OFFLINE) { > pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n", > -- > 2.20.1.2432.ga663e714 > >
Hi, Miaohe and Oscar, Thanks for your kind suggestions! Will send v2 after amend and test. On 4/10/22 11:22 PM, Xu Yu wrote: > Kernel panic when injecting memory_failure for the global huge_zero_page, > when CONFIG_DEBUG_VM is enabled, as follows. > > [ 5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000 > [ 5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00 > [ 5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0 > [ 5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff) > [ 5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000 > [ 5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000 > [ 5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head)) > [ 5.589398] ------------[ cut here ]------------ > [ 5.589952] kernel BUG at mm/huge_memory.c:2499! > [ 5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI > [ 5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11 > [ 5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014 > [ 5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880 > [ 5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b > [ 5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246 > [ 5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000 > [ 5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff > [ 5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff > [ 5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000 > [ 5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40 > [ 5.600693] FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000 > [ 5.601640] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0 > [ 5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 5.604806] Call Trace: > [ 5.605101] <TASK> > [ 5.605357] ? __irq_work_queue_local+0x39/0x70 > [ 5.605904] try_to_split_thp_page+0x3a/0x130 > [ 5.606430] memory_failure+0x128/0x800 > [ 5.606888] madvise_inject_error.cold+0x8b/0xa1 > [ 5.607444] __x64_sys_madvise+0x54/0x60 > [ 5.607915] do_syscall_64+0x35/0x80 > [ 5.608347] entry_SYSCALL_64_after_hwframe+0x44/0xae > [ 5.608949] RIP: 0033:0x7fc3754f8bf9 > [ 5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 > [ 5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c > [ 5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9 > [ 5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000 > [ 5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000 > [ 5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490 > [ 5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000 > [ 5.616626] </TASK> > > In fact, huge_zero_page is unhandlable currently in either soft offline > or memory failure injection. With CONFIG_DEBUG_VM disabled, > huge_zero_page is bailed out when checking HWPoisonHandlable() in > get_any_page(), or checking page mapping in split_huge_page_to_list(). > > This makes huge_zero_page bail out early in madvise_inject_error(), and > panic above won't happen again. > > Reported-by: Abaci <abaci@linux.alibaba.com> > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> > --- > mm/madvise.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 1873616a37d2..03ad50d222e0 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1079,7 +1079,7 @@ static int madvise_inject_error(int behavior, > > for (; start < end; start += size) { > unsigned long pfn; > - struct page *page; > + struct page *page, *head; > int ret; > > ret = get_user_pages_fast(start, 1, 0, &page); > @@ -1087,12 +1087,21 @@ static int madvise_inject_error(int behavior, > return ret; > pfn = page_to_pfn(page); > > + head = compound_head(page); > + if (unlikely(is_huge_zero_page(head))) { > + pr_warn("Unhandlable attempt to %s pfn %#lx at process virtual address %#lx\n", > + behavior == MADV_SOFT_OFFLINE ? "soft offline" : > + "inject memory failure for", > + pfn, start); > + return -EINVAL; > + } > + > /* > * When soft offlining hugepages, after migrating the page > * we dissolve it, therefore in the second loop "page" will > * no longer be a compound page. > */ > - size = page_size(compound_head(page)); > + size = page_size(head); > > if (behavior == MADV_SOFT_OFFLINE) { > pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
On Mon, Apr 11, 2022 at 10:18:26AM +0800, Miaohe Lin wrote: > On 2022/4/10 23:22, Xu Yu wrote: > > Kernel panic when injecting memory_failure for the global huge_zero_page, > > when CONFIG_DEBUG_VM is enabled, as follows. > > > > [ 5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000 > > [ 5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00 > > [ 5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0 > > [ 5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff) > > [ 5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000 > > [ 5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000 > > [ 5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head)) > > [ 5.589398] ------------[ cut here ]------------ > > [ 5.589952] kernel BUG at mm/huge_memory.c:2499! > > [ 5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI > > [ 5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11 > > [ 5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014 > > [ 5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880 > > [ 5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b > > [ 5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246 > > [ 5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000 > > [ 5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff > > [ 5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff > > [ 5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000 > > [ 5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40 > > [ 5.600693] FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000 > > [ 5.601640] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0 > > [ 5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [ 5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > [ 5.604806] Call Trace: > > [ 5.605101] <TASK> > > [ 5.605357] ? __irq_work_queue_local+0x39/0x70 > > [ 5.605904] try_to_split_thp_page+0x3a/0x130 > > [ 5.606430] memory_failure+0x128/0x800 > > [ 5.606888] madvise_inject_error.cold+0x8b/0xa1 > > [ 5.607444] __x64_sys_madvise+0x54/0x60 > > [ 5.607915] do_syscall_64+0x35/0x80 > > [ 5.608347] entry_SYSCALL_64_after_hwframe+0x44/0xae > > [ 5.608949] RIP: 0033:0x7fc3754f8bf9 > > [ 5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 > > [ 5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c > > [ 5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9 > > [ 5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000 > > [ 5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000 > > [ 5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490 > > [ 5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000 > > [ 5.616626] </TASK> > > > > Thanks for the report and the patch! > > I remember I and Naoya discussed the try_to_split_thp_page in memory_failure might come > across non-lru movable compound page and huge_zero_page. We fixed the non-lru movable > compound page case but conclude huge_zero_page won't reach here due to the HWPoisonHandlable() > check. But we missed the MF_COUNT_INCREASED case where HWPoisonHandlable() is skipped. > > > In fact, huge_zero_page is unhandlable currently in either soft offline > > or memory failure injection. With CONFIG_DEBUG_VM disabled, > > huge_zero_page is bailed out when checking HWPoisonHandlable() in > > get_any_page(), or checking page mapping in split_huge_page_to_list(). > > > > This makes huge_zero_page bail out early in madvise_inject_error(), and > > panic above won't happen again. > > It seems this issue is expected to happen only in madvise_inject_error case because > MF_COUNT_INCREASED is only set here. So this fix should do the right thing. But I > don't know whether bail out early for huge_zero_page is suitable. > > Hi Naoya, what do you think? Thank you for reporting. ... > > @@ -1087,12 +1087,21 @@ static int madvise_inject_error(int behavior, > > return ret; > > pfn = page_to_pfn(page); > > > > + head = compound_head(page); > > + if (unlikely(is_huge_zero_page(head))) { > > + pr_warn("Unhandlable attempt to %s pfn %#lx at process virtual address %#lx\n", > > + behavior == MADV_SOFT_OFFLINE ? "soft offline" : > > + "inject memory failure for", > > + pfn, start); > > + return -EINVAL; > > + } This check is about the detail of error handling, so I feel it desirable to do this in memory_failure(). And memory errors on huge zero page is the real scenario, so it seems to me better to make this case injectable rather than EINVAL. How about checking is_huge_zero_page() before try_to_split_thp_page()? The result should be consistent with the results when called by other memory_failure()'s callers like MCE handler and hard_offline_page_store(). diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 9b76222ee237..771fb4fc626c 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1852,6 +1852,12 @@ int memory_failure(unsigned long pfn, int flags) } if (PageTransHuge(hpage)) { + if (is_huge_zero_page(hpage)) { + action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); + res = -EBUSY; + goto unlock_mutex; + } + /* * The flag must be set after the refcount is bumped * otherwise it may race with THP split. Thanks, Naoya Horiguchi
On 2022/4/12 16:31, Oscar Salvador wrote: > On Sun, Apr 10, 2022 at 11:22:34PM +0800, Xu Yu wrote: >> Kernel panic when injecting memory_failure for the global huge_zero_page, >> when CONFIG_DEBUG_VM is enabled, as follows. > ... >> In fact, huge_zero_page is unhandlable currently in either soft offline >> or memory failure injection. With CONFIG_DEBUG_VM disabled, >> huge_zero_page is bailed out when checking HWPoisonHandlable() in >> get_any_page(), or checking page mapping in split_huge_page_to_list(). >> >> This makes huge_zero_page bail out early in madvise_inject_error(), and >> panic above won't happen again. > > I would not special case this in madvise_inject_error() but rather > handle it in memory-failure code. > We do already have HWPoisonHandlable(), which tells us whether the page > is of a type we can really do something about, so why not add another > check in HWPoisonHandlable() for huge_zero_page(), and have that checked > in memory_failure(). IIUC, this does not work. Because HWPoisonHandlable is only called in !MF_COUNT_INCREASED case. But MF_COUNT_INCREASED is always set when called from madvise_inject_error, so HWPoisonHandlable is not even called in this scene. Or am I miss something? BTW: IIRC, LRU isn't set on huge_zero_page. So the origin HWPoisonHandlable can already filter out this page. Thanks! > Something like (untested): > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index dcb6bb9cf731..dccd0503f803 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1181,6 +1181,10 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags) > { > bool movable = false; > > + /* Can't handle huge_zero_page() */ > + if(is_huge_zero_page(compound_head(page))) > + return false; > + > /* Soft offline could mirgate non-LRU movable pages */ > if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page)) > movable = true; > @@ -1796,6 +1800,10 @@ int memory_failure(unsigned long pfn, int flags) > res = -EBUSY; > goto unlock_mutex; > } > + } else if(!HWPoisonHandable(p)) { > + action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); > + res = -EBUSY; > + goto unlock_mutex; > } > > if (PageTransHuge(hpage)) { > > It can certainly be prettier, but you can get the idea. > > >> >> Reported-by: Abaci <abaci@linux.alibaba.com> >> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> >> --- >> mm/madvise.c | 13 +++++++++++-- >> 1 file changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/mm/madvise.c b/mm/madvise.c >> index 1873616a37d2..03ad50d222e0 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -1079,7 +1079,7 @@ static int madvise_inject_error(int behavior, >> >> for (; start < end; start += size) { >> unsigned long pfn; >> - struct page *page; >> + struct page *page, *head; >> int ret; >> >> ret = get_user_pages_fast(start, 1, 0, &page); >> @@ -1087,12 +1087,21 @@ static int madvise_inject_error(int behavior, >> return ret; >> pfn = page_to_pfn(page); >> >> + head = compound_head(page); >> + if (unlikely(is_huge_zero_page(head))) { >> + pr_warn("Unhandlable attempt to %s pfn %#lx at process virtual address %#lx\n", >> + behavior == MADV_SOFT_OFFLINE ? "soft offline" : >> + "inject memory failure for", >> + pfn, start); >> + return -EINVAL; >> + } >> + >> /* >> * When soft offlining hugepages, after migrating the page >> * we dissolve it, therefore in the second loop "page" will >> * no longer be a compound page. >> */ >> - size = page_size(compound_head(page)); >> + size = page_size(head); >> >> if (behavior == MADV_SOFT_OFFLINE) { >> pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n", >> -- >> 2.20.1.2432.ga663e714 >> >> >
On Tue, Apr 12, 2022 at 05:25:52PM +0800, Miaohe Lin wrote: > On 2022/4/12 16:31, Oscar Salvador wrote: > > On Sun, Apr 10, 2022 at 11:22:34PM +0800, Xu Yu wrote: > >> Kernel panic when injecting memory_failure for the global huge_zero_page, > >> when CONFIG_DEBUG_VM is enabled, as follows. > > ... > >> In fact, huge_zero_page is unhandlable currently in either soft offline > >> or memory failure injection. With CONFIG_DEBUG_VM disabled, > >> huge_zero_page is bailed out when checking HWPoisonHandlable() in > >> get_any_page(), or checking page mapping in split_huge_page_to_list(). > >> > >> This makes huge_zero_page bail out early in madvise_inject_error(), and > >> panic above won't happen again. > > > > I would not special case this in madvise_inject_error() but rather > > handle it in memory-failure code. > > We do already have HWPoisonHandlable(), which tells us whether the page > > is of a type we can really do something about, so why not add another > > check in HWPoisonHandlable() for huge_zero_page(), and have that checked > > in memory_failure(). > > IIUC, this does not work. Because HWPoisonHandlable is only called in !MF_COUNT_INCREASED case. > But MF_COUNT_INCREASED is always set when called from madvise_inject_error, so HWPoisonHandlable > is not even called in this scene. Or am I miss something? But nothing stops you from calling it in memory_failure(), right? if (MF_COUNT_INCREASED not set) { .... ... } else if(!HWPoisonHandable(p)) { action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); res = -EBUSY; goto unlock_mutex; } > BTW: IIRC, LRU isn't set on huge_zero_page. So the origin HWPoisonHandlable can already filter out this page. I would rather have it as a explicit check than buried in that kind of assumption. But after all, Naoya's suggestion might just be better and more focused.
On 4/12/22 5:09 PM, Naoya Horiguchi wrote: > On Mon, Apr 11, 2022 at 10:18:26AM +0800, Miaohe Lin wrote: >> On 2022/4/10 23:22, Xu Yu wrote: >>> Kernel panic when injecting memory_failure for the global huge_zero_page, >>> when CONFIG_DEBUG_VM is enabled, as follows. >>> >>> [ 5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000 >>> [ 5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00 >>> [ 5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0 >>> [ 5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff) >>> [ 5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000 >>> [ 5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000 >>> [ 5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head)) >>> [ 5.589398] ------------[ cut here ]------------ >>> [ 5.589952] kernel BUG at mm/huge_memory.c:2499! >>> [ 5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI >>> [ 5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11 >>> [ 5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014 >>> [ 5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880 >>> [ 5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b >>> [ 5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246 >>> [ 5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000 >>> [ 5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff >>> [ 5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff >>> [ 5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000 >>> [ 5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40 >>> [ 5.600693] FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000 >>> [ 5.601640] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0 >>> [ 5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> [ 5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> [ 5.604806] Call Trace: >>> [ 5.605101] <TASK> >>> [ 5.605357] ? __irq_work_queue_local+0x39/0x70 >>> [ 5.605904] try_to_split_thp_page+0x3a/0x130 >>> [ 5.606430] memory_failure+0x128/0x800 >>> [ 5.606888] madvise_inject_error.cold+0x8b/0xa1 >>> [ 5.607444] __x64_sys_madvise+0x54/0x60 >>> [ 5.607915] do_syscall_64+0x35/0x80 >>> [ 5.608347] entry_SYSCALL_64_after_hwframe+0x44/0xae >>> [ 5.608949] RIP: 0033:0x7fc3754f8bf9 >>> [ 5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 >>> [ 5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c >>> [ 5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9 >>> [ 5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000 >>> [ 5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000 >>> [ 5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490 >>> [ 5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000 >>> [ 5.616626] </TASK> >>> >> >> Thanks for the report and the patch! >> >> I remember I and Naoya discussed the try_to_split_thp_page in memory_failure might come >> across non-lru movable compound page and huge_zero_page. We fixed the non-lru movable >> compound page case but conclude huge_zero_page won't reach here due to the HWPoisonHandlable() >> check. But we missed the MF_COUNT_INCREASED case where HWPoisonHandlable() is skipped. >> >>> In fact, huge_zero_page is unhandlable currently in either soft offline >>> or memory failure injection. With CONFIG_DEBUG_VM disabled, >>> huge_zero_page is bailed out when checking HWPoisonHandlable() in >>> get_any_page(), or checking page mapping in split_huge_page_to_list(). >>> >>> This makes huge_zero_page bail out early in madvise_inject_error(), and >>> panic above won't happen again. >> >> It seems this issue is expected to happen only in madvise_inject_error case because >> MF_COUNT_INCREASED is only set here. So this fix should do the right thing. But I >> don't know whether bail out early for huge_zero_page is suitable. >> >> Hi Naoya, what do you think? > > Thank you for reporting. > > ... > >>> @@ -1087,12 +1087,21 @@ static int madvise_inject_error(int behavior, >>> return ret; >>> pfn = page_to_pfn(page); >>> >>> + head = compound_head(page); >>> + if (unlikely(is_huge_zero_page(head))) { >>> + pr_warn("Unhandlable attempt to %s pfn %#lx at process virtual address %#lx\n", >>> + behavior == MADV_SOFT_OFFLINE ? "soft offline" : >>> + "inject memory failure for", >>> + pfn, start); >>> + return -EINVAL; >>> + } > > This check is about the detail of error handling, so I feel it desirable to > do this in memory_failure(). And memory errors on huge zero page is the > real scenario, so it seems to me better to make this case injectable rather > than EINVAL. > > How about checking is_huge_zero_page() before try_to_split_thp_page()? > The result should be consistent with the results when called by other > memory_failure()'s callers like MCE handler and hard_offline_page_store(). Agree. thanks! > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 9b76222ee237..771fb4fc626c 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1852,6 +1852,12 @@ int memory_failure(unsigned long pfn, int flags) > } > > if (PageTransHuge(hpage)) { > + if (is_huge_zero_page(hpage)) { > + action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); Should we use MF_MSG_UNSPLIT_THP instead of MF_MSG_KERNEL_HIGH_ORDER? And should we SetPageHasHWPoisoned(hpage) for huge zero page, since TestSetPageHWPoison(p) is done in the early part of memory_failure(). > + res = -EBUSY; > + goto unlock_mutex; > + } > + > /* > * The flag must be set after the refcount is bumped > * otherwise it may race with THP split. > > > Thanks, > Naoya Horiguchi
On 4/12/22 5:30 PM, Oscar Salvador wrote: > On Tue, Apr 12, 2022 at 05:25:52PM +0800, Miaohe Lin wrote: >> On 2022/4/12 16:31, Oscar Salvador wrote: >>> On Sun, Apr 10, 2022 at 11:22:34PM +0800, Xu Yu wrote: >>>> Kernel panic when injecting memory_failure for the global huge_zero_page, >>>> when CONFIG_DEBUG_VM is enabled, as follows. >>> ... >>>> In fact, huge_zero_page is unhandlable currently in either soft offline >>>> or memory failure injection. With CONFIG_DEBUG_VM disabled, >>>> huge_zero_page is bailed out when checking HWPoisonHandlable() in >>>> get_any_page(), or checking page mapping in split_huge_page_to_list(). >>>> >>>> This makes huge_zero_page bail out early in madvise_inject_error(), and >>>> panic above won't happen again. >>> >>> I would not special case this in madvise_inject_error() but rather >>> handle it in memory-failure code. >>> We do already have HWPoisonHandlable(), which tells us whether the page >>> is of a type we can really do something about, so why not add another >>> check in HWPoisonHandlable() for huge_zero_page(), and have that checked >>> in memory_failure(). >> >> IIUC, this does not work. Because HWPoisonHandlable is only called in !MF_COUNT_INCREASED case. >> But MF_COUNT_INCREASED is always set when called from madvise_inject_error, so HWPoisonHandlable >> is not even called in this scene. Or am I miss something? > > But nothing stops you from calling it in memory_failure(), right? > > if (MF_COUNT_INCREASED not set) { > .... > ... > } else if(!HWPoisonHandable(p)) { > action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); > res = -EBUSY; > goto unlock_mutex; > } > >> BTW: IIRC, LRU isn't set on huge_zero_page. So the origin HWPoisonHandlable can already filter out this page. > > I would rather have it as a explicit check than buried in that kind of > assumption. And this is also why I decided to bail out huge zero page early for both soft offline and memory failure. > > But after all, Naoya's suggestion might just be better and more focused. Agree, thanks! > >
On 4/12/22 5:45 PM, Yu Xu wrote: > On 4/12/22 5:09 PM, Naoya Horiguchi wrote: >> On Mon, Apr 11, 2022 at 10:18:26AM +0800, Miaohe Lin wrote: >>> On 2022/4/10 23:22, Xu Yu wrote: >>>> Kernel panic when injecting memory_failure for the global >>>> huge_zero_page, >>>> when CONFIG_DEBUG_VM is enabled, as follows. >>>> >>>> [ 5.582720] Injecting memory failure for pfn 0x109ff9 at process >>>> virtual address 0x20ff9000 >>>> [ 5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 >>>> mapping:0000000000000000 index:0x0 pfn:0x109e00 >>>> [ 5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 >>>> compound_pincount:0 >>>> [ 5.585796] flags: >>>> 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff) >>>> [ 5.586712] raw: 017fffc000010001 0000000000000000 >>>> dead000000000122 0000000000000000 >>>> [ 5.587640] raw: 0000000000000000 0000000000000000 >>>> 00000002ffffffff 0000000000000000 >>>> [ 5.588565] page dumped because: >>>> VM_BUG_ON_PAGE(is_huge_zero_page(head)) >>>> [ 5.589398] ------------[ cut here ]------------ >>>> [ 5.589952] kernel BUG at mm/huge_memory.c:2499! >>>> [ 5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI >>>> [ 5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted >>>> 5.18.0-rc1+ #11 >>>> [ 5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS >>>> 3288b3c 04/01/2014 >>>> [ 5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880 >>>> [ 5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a >>>> 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b >>>> [ 5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246 >>>> [ 5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: >>>> 0000000000000000 >>>> [ 5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: >>>> 00000000ffffffff >>>> [ 5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: >>>> 00000000fffeffff >>>> [ 5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: >>>> ffffea0004278000 >>>> [ 5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: >>>> ffffea000427fe40 >>>> [ 5.600693] FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) >>>> knlGS:0000000000000000 >>>> [ 5.601640] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [ 5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: >>>> 00000000003706e0 >>>> [ 5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >>>> 0000000000000000 >>>> [ 5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >>>> 0000000000000400 >>>> [ 5.604806] Call Trace: >>>> [ 5.605101] <TASK> >>>> [ 5.605357] ? __irq_work_queue_local+0x39/0x70 >>>> [ 5.605904] try_to_split_thp_page+0x3a/0x130 >>>> [ 5.606430] memory_failure+0x128/0x800 >>>> [ 5.606888] madvise_inject_error.cold+0x8b/0xa1 >>>> [ 5.607444] __x64_sys_madvise+0x54/0x60 >>>> [ 5.607915] do_syscall_64+0x35/0x80 >>>> [ 5.608347] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>> [ 5.608949] RIP: 0033:0x7fc3754f8bf9 >>>> [ 5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f >>>> 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 >>>> [ 5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: >>>> 000000000000001c >>>> [ 5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: >>>> 00007fc3754f8bf9 >>>> [ 5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: >>>> 0000000020ff9000 >>>> [ 5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: >>>> 0000000000000000 >>>> [ 5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: >>>> 0000000000400490 >>>> [ 5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: >>>> 0000000000000000 >>>> [ 5.616626] </TASK> >>>> >>> >>> Thanks for the report and the patch! >>> >>> I remember I and Naoya discussed the try_to_split_thp_page in >>> memory_failure might come >>> across non-lru movable compound page and huge_zero_page. We fixed the >>> non-lru movable >>> compound page case but conclude huge_zero_page won't reach here due >>> to the HWPoisonHandlable() >>> check. But we missed the MF_COUNT_INCREASED case where >>> HWPoisonHandlable() is skipped. >>> >>>> In fact, huge_zero_page is unhandlable currently in either soft offline >>>> or memory failure injection. With CONFIG_DEBUG_VM disabled, >>>> huge_zero_page is bailed out when checking HWPoisonHandlable() in >>>> get_any_page(), or checking page mapping in split_huge_page_to_list(). >>>> >>>> This makes huge_zero_page bail out early in madvise_inject_error(), and >>>> panic above won't happen again. >>> >>> It seems this issue is expected to happen only in >>> madvise_inject_error case because >>> MF_COUNT_INCREASED is only set here. So this fix should do the right >>> thing. But I >>> don't know whether bail out early for huge_zero_page is suitable. >>> >>> Hi Naoya, what do you think? >> >> Thank you for reporting. >> >> ... >> >>>> @@ -1087,12 +1087,21 @@ static int madvise_inject_error(int behavior, >>>> return ret; >>>> pfn = page_to_pfn(page); >>>> + head = compound_head(page); >>>> + if (unlikely(is_huge_zero_page(head))) { >>>> + pr_warn("Unhandlable attempt to %s pfn %#lx at process >>>> virtual address %#lx\n", >>>> + behavior == MADV_SOFT_OFFLINE ? "soft offline" : >>>> + "inject memory failure for", >>>> + pfn, start); >>>> + return -EINVAL; >>>> + } >> >> This check is about the detail of error handling, so I feel it >> desirable to >> do this in memory_failure(). And memory errors on huge zero page is the >> real scenario, so it seems to me better to make this case injectable >> rather >> than EINVAL. >> >> How about checking is_huge_zero_page() before try_to_split_thp_page()? >> The result should be consistent with the results when called by other >> memory_failure()'s callers like MCE handler and >> hard_offline_page_store(). > > Agree. thanks! > >> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >> index 9b76222ee237..771fb4fc626c 100644 >> --- a/mm/memory-failure.c >> +++ b/mm/memory-failure.c >> @@ -1852,6 +1852,12 @@ int memory_failure(unsigned long pfn, int flags) >> } >> if (PageTransHuge(hpage)) { >> + if (is_huge_zero_page(hpage)) { >> + action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); > > Should we use MF_MSG_UNSPLIT_THP instead of MF_MSG_KERNEL_HIGH_ORDER? > > And should we SetPageHasHWPoisoned(hpage) for huge zero page, since > TestSetPageHWPoison(p) is done in the early part of memory_failure(). If so, we just need to add a one-line condition in try_to_split_thp_page(). > >> + res = -EBUSY; >> + goto unlock_mutex; >> + } >> + >> /* >> * The flag must be set after the refcount is bumped >> * otherwise it may race with THP split. >> >> >> Thanks, >> Naoya Horiguchi >
On 2022/4/12 17:30, Oscar Salvador wrote: > On Tue, Apr 12, 2022 at 05:25:52PM +0800, Miaohe Lin wrote: >> On 2022/4/12 16:31, Oscar Salvador wrote: >>> On Sun, Apr 10, 2022 at 11:22:34PM +0800, Xu Yu wrote: >>>> Kernel panic when injecting memory_failure for the global huge_zero_page, >>>> when CONFIG_DEBUG_VM is enabled, as follows. >>> ... >>>> In fact, huge_zero_page is unhandlable currently in either soft offline >>>> or memory failure injection. With CONFIG_DEBUG_VM disabled, >>>> huge_zero_page is bailed out when checking HWPoisonHandlable() in >>>> get_any_page(), or checking page mapping in split_huge_page_to_list(). >>>> >>>> This makes huge_zero_page bail out early in madvise_inject_error(), and >>>> panic above won't happen again. >>> >>> I would not special case this in madvise_inject_error() but rather >>> handle it in memory-failure code. >>> We do already have HWPoisonHandlable(), which tells us whether the page >>> is of a type we can really do something about, so why not add another >>> check in HWPoisonHandlable() for huge_zero_page(), and have that checked >>> in memory_failure(). >> >> IIUC, this does not work. Because HWPoisonHandlable is only called in !MF_COUNT_INCREASED case. >> But MF_COUNT_INCREASED is always set when called from madvise_inject_error, so HWPoisonHandlable >> is not even called in this scene. Or am I miss something? > > But nothing stops you from calling it in memory_failure(), right? > > if (MF_COUNT_INCREASED not set) { > .... > ... > } else if(!HWPoisonHandable(p)) { > action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); > res = -EBUSY; > goto unlock_mutex; > } Yes, I somewhat misread the proposed code. Thanks for clarifying. :) > >> BTW: IIRC, LRU isn't set on huge_zero_page. So the origin HWPoisonHandlable can already filter out this page. > > I would rather have it as a explicit check than buried in that kind of > assumption. > > But after all, Naoya's suggestion might just be better and more focused. > >
On 2022/4/12 17:09, Naoya Horiguchi wrote: > On Mon, Apr 11, 2022 at 10:18:26AM +0800, Miaohe Lin wrote: >> On 2022/4/10 23:22, Xu Yu wrote: >>> Kernel panic when injecting memory_failure for the global huge_zero_page, >>> when CONFIG_DEBUG_VM is enabled, as follows. >>> >>> [ 5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000 >>> [ 5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00 >>> [ 5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0 >>> [ 5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff) >>> [ 5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000 >>> [ 5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000 >>> [ 5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head)) >>> [ 5.589398] ------------[ cut here ]------------ >>> [ 5.589952] kernel BUG at mm/huge_memory.c:2499! >>> [ 5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI >>> [ 5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11 >>> [ 5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014 >>> [ 5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880 >>> [ 5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b >>> [ 5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246 >>> [ 5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000 >>> [ 5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff >>> [ 5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff >>> [ 5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000 >>> [ 5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40 >>> [ 5.600693] FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000 >>> [ 5.601640] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0 >>> [ 5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> [ 5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> [ 5.604806] Call Trace: >>> [ 5.605101] <TASK> >>> [ 5.605357] ? __irq_work_queue_local+0x39/0x70 >>> [ 5.605904] try_to_split_thp_page+0x3a/0x130 >>> [ 5.606430] memory_failure+0x128/0x800 >>> [ 5.606888] madvise_inject_error.cold+0x8b/0xa1 >>> [ 5.607444] __x64_sys_madvise+0x54/0x60 >>> [ 5.607915] do_syscall_64+0x35/0x80 >>> [ 5.608347] entry_SYSCALL_64_after_hwframe+0x44/0xae >>> [ 5.608949] RIP: 0033:0x7fc3754f8bf9 >>> [ 5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 >>> [ 5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c >>> [ 5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9 >>> [ 5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000 >>> [ 5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000 >>> [ 5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490 >>> [ 5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000 >>> [ 5.616626] </TASK> >>> >> >> Thanks for the report and the patch! >> >> I remember I and Naoya discussed the try_to_split_thp_page in memory_failure might come >> across non-lru movable compound page and huge_zero_page. We fixed the non-lru movable >> compound page case but conclude huge_zero_page won't reach here due to the HWPoisonHandlable() >> check. But we missed the MF_COUNT_INCREASED case where HWPoisonHandlable() is skipped. >> >>> In fact, huge_zero_page is unhandlable currently in either soft offline >>> or memory failure injection. With CONFIG_DEBUG_VM disabled, >>> huge_zero_page is bailed out when checking HWPoisonHandlable() in >>> get_any_page(), or checking page mapping in split_huge_page_to_list(). >>> >>> This makes huge_zero_page bail out early in madvise_inject_error(), and >>> panic above won't happen again. >> >> It seems this issue is expected to happen only in madvise_inject_error case because >> MF_COUNT_INCREASED is only set here. So this fix should do the right thing. But I >> don't know whether bail out early for huge_zero_page is suitable. >> >> Hi Naoya, what do you think? > > Thank you for reporting. > > ... > >>> @@ -1087,12 +1087,21 @@ static int madvise_inject_error(int behavior, >>> return ret; >>> pfn = page_to_pfn(page); >>> >>> + head = compound_head(page); >>> + if (unlikely(is_huge_zero_page(head))) { >>> + pr_warn("Unhandlable attempt to %s pfn %#lx at process virtual address %#lx\n", >>> + behavior == MADV_SOFT_OFFLINE ? "soft offline" : >>> + "inject memory failure for", >>> + pfn, start); >>> + return -EINVAL; >>> + } > > This check is about the detail of error handling, so I feel it desirable to > do this in memory_failure(). And memory errors on huge zero page is the > real scenario, so it seems to me better to make this case injectable rather > than EINVAL. > > How about checking is_huge_zero_page() before try_to_split_thp_page()? > The result should be consistent with the results when called by other > memory_failure()'s callers like MCE handler and hard_offline_page_store(). Yes, they should be same except HWPoisonHandable isn't called for callers like MCE handler and hard_offline_page_store(). > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 9b76222ee237..771fb4fc626c 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1852,6 +1852,12 @@ int memory_failure(unsigned long pfn, int flags) > } > > if (PageTransHuge(hpage)) { > + if (is_huge_zero_page(hpage)) { > + action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); > + res = -EBUSY; > + goto unlock_mutex; > + } > + It seems that huge_zero_page could be handled simply by zap the corresponding page table without loss any user data. Should we also try to handle this kind of page? Or just bail out as it's rare? Thanks! > /* > * The flag must be set after the refcount is bumped > * otherwise it may race with THP split. > > > Thanks, > Naoya Horiguchi > . >
On Tue, Apr 12, 2022 at 06:00:09PM +0800, Yu Xu wrote: > On 4/12/22 5:45 PM, Yu Xu wrote: ... > > > > > @@ -1087,12 +1087,21 @@ static int madvise_inject_error(int behavior, > > > > > return ret; > > > > > pfn = page_to_pfn(page); > > > > > + head = compound_head(page); > > > > > + if (unlikely(is_huge_zero_page(head))) { > > > > > + pr_warn("Unhandlable attempt to %s pfn %#lx at > > > > > process virtual address %#lx\n", > > > > > + behavior == MADV_SOFT_OFFLINE ? "soft offline" : > > > > > + "inject memory failure for", > > > > > + pfn, start); > > > > > + return -EINVAL; > > > > > + } > > > > > > This check is about the detail of error handling, so I feel it > > > desirable to > > > do this in memory_failure(). And memory errors on huge zero page is the > > > real scenario, so it seems to me better to make this case injectable > > > rather > > > than EINVAL. > > > > > > How about checking is_huge_zero_page() before try_to_split_thp_page()? > > > The result should be consistent with the results when called by other > > > memory_failure()'s callers like MCE handler and > > > hard_offline_page_store(). > > > > Agree. thanks! > > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > index 9b76222ee237..771fb4fc626c 100644 > > > --- a/mm/memory-failure.c > > > +++ b/mm/memory-failure.c > > > @@ -1852,6 +1852,12 @@ int memory_failure(unsigned long pfn, int flags) > > > } > > > if (PageTransHuge(hpage)) { > > > + if (is_huge_zero_page(hpage)) { > > > + action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); > > > > Should we use MF_MSG_UNSPLIT_THP instead of MF_MSG_KERNEL_HIGH_ORDER? > > > > And should we SetPageHasHWPoisoned(hpage) for huge zero page, since > > TestSetPageHWPoison(p) is done in the early part of memory_failure(). Yeah, these could be possible. I suggested them to get the same result regardless of calling interfaces of memory_failure(). How huge_zero pages should be handled is separate from the reported issue on VM_BUG_ON_PAGE(), so it would require a separate patch (which updates MF_COUNT_INCREASED=true case too). Thanks, Naoya Horiguchi > > If so, we just need to add a one-line condition in > try_to_split_thp_page().
On Tue, Apr 12, 2022 at 07:08:45PM +0800, Miaohe Lin wrote: ... > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > index 9b76222ee237..771fb4fc626c 100644 > > --- a/mm/memory-failure.c > > +++ b/mm/memory-failure.c > > @@ -1852,6 +1852,12 @@ int memory_failure(unsigned long pfn, int flags) > > } > > > > if (PageTransHuge(hpage)) { > > + if (is_huge_zero_page(hpage)) { > > + action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); > > + res = -EBUSY; > > + goto unlock_mutex; > > + } > > + > > It seems that huge_zero_page could be handled simply by zap the corresponding page table without > loss any user data. Yes, zapping all page table entries to huge_zero_page is OK, and I think that maybe huge_zero_page should be set to NULL. The broken huge_zero page has no user data, but could have corrupted data (with unexpected non-zero bits), so it's safer to replace with new zero pages. And get_huge_zero_page() seems to allocate a new huge zero page if huge_zero_page is NULL when called, so it would be gracefully switched to new one on the first later access. > Should we also try to handle this kind of page? Or just bail out as it's rare? We should handle it if it's worth doing. I think that memory errors on zero pages might be rare events (because they occupy small portion of physicall memory). But if zero pages could be used by many process, the impact of the error might be non-negligible. Thanks, Naoya Horiguchi
On 2022/4/13 16:36, HORIGUCHI NAOYA(堀口 直也) wrote: > On Tue, Apr 12, 2022 at 07:08:45PM +0800, Miaohe Lin wrote: > ... >>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >>> index 9b76222ee237..771fb4fc626c 100644 >>> --- a/mm/memory-failure.c >>> +++ b/mm/memory-failure.c >>> @@ -1852,6 +1852,12 @@ int memory_failure(unsigned long pfn, int flags) >>> } >>> >>> if (PageTransHuge(hpage)) { >>> + if (is_huge_zero_page(hpage)) { >>> + action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); >>> + res = -EBUSY; >>> + goto unlock_mutex; >>> + } >>> + >> >> It seems that huge_zero_page could be handled simply by zap the corresponding page table without >> loss any user data. > > Yes, zapping all page table entries to huge_zero_page is OK, and I think > that maybe huge_zero_page should be set to NULL. The broken huge_zero page > has no user data, but could have corrupted data (with unexpected non-zero > bits), so it's safer to replace with new zero pages. And > get_huge_zero_page() seems to allocate a new huge zero page if > huge_zero_page is NULL when called, so it would be gracefully switched > to new one on the first later access. Agree. > >> Should we also try to handle this kind of page? Or just bail out as it's rare? > > We should handle it if it's worth doing. I think that memory errors on zero > pages might be rare events (because they occupy small portion of physicall > memory). But if zero pages could be used by many process, the impact of the > error might be non-negligible. Yes, when this becomes non-negligible, we could handle it. :) Thanks. > > Thanks, > Naoya Horiguchi >
diff --git a/mm/madvise.c b/mm/madvise.c index 1873616a37d2..03ad50d222e0 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1079,7 +1079,7 @@ static int madvise_inject_error(int behavior, for (; start < end; start += size) { unsigned long pfn; - struct page *page; + struct page *page, *head; int ret; ret = get_user_pages_fast(start, 1, 0, &page); @@ -1087,12 +1087,21 @@ static int madvise_inject_error(int behavior, return ret; pfn = page_to_pfn(page); + head = compound_head(page); + if (unlikely(is_huge_zero_page(head))) { + pr_warn("Unhandlable attempt to %s pfn %#lx at process virtual address %#lx\n", + behavior == MADV_SOFT_OFFLINE ? "soft offline" : + "inject memory failure for", + pfn, start); + return -EINVAL; + } + /* * When soft offlining hugepages, after migrating the page * we dissolve it, therefore in the second loop "page" will * no longer be a compound page. */ - size = page_size(compound_head(page)); + size = page_size(head); if (behavior == MADV_SOFT_OFFLINE) { pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
Kernel panic when injecting memory_failure for the global huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows. [ 5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000 [ 5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00 [ 5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0 [ 5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff) [ 5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000 [ 5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000 [ 5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head)) [ 5.589398] ------------[ cut here ]------------ [ 5.589952] kernel BUG at mm/huge_memory.c:2499! [ 5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11 [ 5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014 [ 5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880 [ 5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b [ 5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246 [ 5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000 [ 5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff [ 5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff [ 5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000 [ 5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40 [ 5.600693] FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000 [ 5.601640] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0 [ 5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5.604806] Call Trace: [ 5.605101] <TASK> [ 5.605357] ? __irq_work_queue_local+0x39/0x70 [ 5.605904] try_to_split_thp_page+0x3a/0x130 [ 5.606430] memory_failure+0x128/0x800 [ 5.606888] madvise_inject_error.cold+0x8b/0xa1 [ 5.607444] __x64_sys_madvise+0x54/0x60 [ 5.607915] do_syscall_64+0x35/0x80 [ 5.608347] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 5.608949] RIP: 0033:0x7fc3754f8bf9 [ 5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 [ 5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c [ 5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9 [ 5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000 [ 5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000 [ 5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490 [ 5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000 [ 5.616626] </TASK> In fact, huge_zero_page is unhandlable currently in either soft offline or memory failure injection. With CONFIG_DEBUG_VM disabled, huge_zero_page is bailed out when checking HWPoisonHandlable() in get_any_page(), or checking page mapping in split_huge_page_to_list(). This makes huge_zero_page bail out early in madvise_inject_error(), and panic above won't happen again. Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> --- mm/madvise.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)