Message ID | 20250324131750.1551884-1-tujinjiang@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_range | expand |
On 24.03.25 14:17, Jinjiang Tu wrote: > We triggered the below BUG: > > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 pfn:0x240402 > head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff) > page_type: f4(hugetlb) > page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1) > ------------[ cut here ]------------ > kernel BUG at ./include/linux/page-flags.h:310! > Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > Modules linked in: > CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374 > Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 > pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > pc : const_folio_flags+0x3c/0x58 > lr : const_folio_flags+0x3c/0x58 > Call trace: > const_folio_flags+0x3c/0x58 (P) > do_migrate_range+0x164/0x720 > offline_pages+0x63c/0x6fc > memory_subsys_offline+0x190/0x1f4 > device_offline+0xc0/0x13c > state_store+0x90/0xd8 > dev_attr_store+0x18/0x2c > sysfs_kf_write+0x44/0x54 > kernfs_fop_write_iter+0x120/0x1cc > vfs_write+0x240/0x378 > ksys_write+0x70/0x108 > __arm64_sys_write+0x1c/0x28 > invoke_syscall+0x48/0x10c > el0_svc_common.constprop.0+0x40/0xe0 > > When allocating a hugetlb folio, between the folio is taken from buddy > and prep_compound_page() is called, start_isolate_page_range() and > do_migrate_range() is called. When do_migrate_range() scans the head page > of the hugetlb folio, the compound_head field isn't set, so scans the > tail page next. And at this time, the compound_head field of tail page is > set, folio_test_large() is called by tail page, thus triggers VM_BUG_ON(). > > To fix it, get folio refcount before calling folio_test_large(). > > Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration") > Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> > --- > mm/memory_hotplug.c | 12 +++--------- > 1 file changed, 3 insertions(+), 9 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 16cf9e17077e..f600c26ce5de 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) > page = pfn_to_page(pfn); > folio = page_folio(page); > > - /* > - * No reference or lock is held on the folio, so it might > - * be modified concurrently (e.g. split). As such, > - * folio_nr_pages() may read garbage. This is fine as the outer > - * loop will revisit the split folio later. > - */ > - if (folio_test_large(folio)) > - pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; > - > if (!folio_try_get(folio)) > continue; > > if (unlikely(page_folio(page) != folio)) > goto put_folio; > > + if (folio_test_large(folio)) > + pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; Moving that will not make it able to skip the large frozen (refcount==0, e.g., free hugetlb) folio in the continue/put_folio case above. Hmmmm .. We could similarly to dumping folios, snapshot them, so we can read stable data.
在 2025/3/24 21:44, David Hildenbrand 写道: > On 24.03.25 14:17, Jinjiang Tu wrote: >> We triggered the below BUG: >> >> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 >> pfn:0x240402 >> head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 >> pincount:0 >> flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff) >> page_type: f4(hugetlb) >> page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1) >> ------------[ cut here ]------------ >> kernel BUG at ./include/linux/page-flags.h:310! >> Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP >> Modules linked in: >> CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374 >> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >> pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> pc : const_folio_flags+0x3c/0x58 >> lr : const_folio_flags+0x3c/0x58 >> Call trace: >> const_folio_flags+0x3c/0x58 (P) >> do_migrate_range+0x164/0x720 >> offline_pages+0x63c/0x6fc >> memory_subsys_offline+0x190/0x1f4 >> device_offline+0xc0/0x13c >> state_store+0x90/0xd8 >> dev_attr_store+0x18/0x2c >> sysfs_kf_write+0x44/0x54 >> kernfs_fop_write_iter+0x120/0x1cc >> vfs_write+0x240/0x378 >> ksys_write+0x70/0x108 >> __arm64_sys_write+0x1c/0x28 >> invoke_syscall+0x48/0x10c >> el0_svc_common.constprop.0+0x40/0xe0 >> >> When allocating a hugetlb folio, between the folio is taken from buddy >> and prep_compound_page() is called, start_isolate_page_range() and >> do_migrate_range() is called. When do_migrate_range() scans the head >> page >> of the hugetlb folio, the compound_head field isn't set, so scans the >> tail page next. And at this time, the compound_head field of tail >> page is >> set, folio_test_large() is called by tail page, thus triggers >> VM_BUG_ON(). >> >> To fix it, get folio refcount before calling folio_test_large(). >> >> Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports >> thp migration") >> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> >> --- >> mm/memory_hotplug.c | 12 +++--------- >> 1 file changed, 3 insertions(+), 9 deletions(-) >> >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >> index 16cf9e17077e..f600c26ce5de 100644 >> --- a/mm/memory_hotplug.c >> +++ b/mm/memory_hotplug.c >> @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long >> start_pfn, unsigned long end_pfn) >> page = pfn_to_page(pfn); >> folio = page_folio(page); >> - /* >> - * No reference or lock is held on the folio, so it might >> - * be modified concurrently (e.g. split). As such, >> - * folio_nr_pages() may read garbage. This is fine as the >> outer >> - * loop will revisit the split folio later. >> - */ >> - if (folio_test_large(folio)) >> - pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; >> - >> if (!folio_try_get(folio)) >> continue; >> if (unlikely(page_folio(page) != folio)) >> goto put_folio; >> + if (folio_test_large(folio)) >> + pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; > > Moving that will not make it able to skip the large frozen > (refcount==0, e.g., free hugetlb) folio in the continue/put_folio case > above. Hmmmm .. For free hugetlb, pfn is increased by 1 in each loop. This leads to skip free hugetlb slower. > > We could similarly to dumping folios, snapshot them, so we can read > stable data. extract the code in __dump_page()? But snapshot may lead to do_migrate_range() slower too.
On Tue, Mar 25, 2025 at 11:02:30AM +0800, Jinjiang Tu wrote: > extract the code in __dump_page()? But snapshot may lead to > do_migrate_range() slower too. Yes, but offline is already a slow operationg anyway, so it might be worth doing if we do not want to step into this "middle-operations".
On 25.03.25 04:02, Jinjiang Tu wrote: > > 在 2025/3/24 21:44, David Hildenbrand 写道: >> On 24.03.25 14:17, Jinjiang Tu wrote: >>> We triggered the below BUG: >>> >>> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 >>> pfn:0x240402 >>> head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 >>> pincount:0 >>> flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff) >>> page_type: f4(hugetlb) >>> page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1) >>> ------------[ cut here ]------------ >>> kernel BUG at ./include/linux/page-flags.h:310! >>> Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP >>> Modules linked in: >>> CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374 >>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >>> pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>> pc : const_folio_flags+0x3c/0x58 >>> lr : const_folio_flags+0x3c/0x58 >>> Call trace: >>> const_folio_flags+0x3c/0x58 (P) >>> do_migrate_range+0x164/0x720 >>> offline_pages+0x63c/0x6fc >>> memory_subsys_offline+0x190/0x1f4 >>> device_offline+0xc0/0x13c >>> state_store+0x90/0xd8 >>> dev_attr_store+0x18/0x2c >>> sysfs_kf_write+0x44/0x54 >>> kernfs_fop_write_iter+0x120/0x1cc >>> vfs_write+0x240/0x378 >>> ksys_write+0x70/0x108 >>> __arm64_sys_write+0x1c/0x28 >>> invoke_syscall+0x48/0x10c >>> el0_svc_common.constprop.0+0x40/0xe0 >>> >>> When allocating a hugetlb folio, between the folio is taken from buddy >>> and prep_compound_page() is called, start_isolate_page_range() and >>> do_migrate_range() is called. When do_migrate_range() scans the head >>> page >>> of the hugetlb folio, the compound_head field isn't set, so scans the >>> tail page next. And at this time, the compound_head field of tail >>> page is >>> set, folio_test_large() is called by tail page, thus triggers >>> VM_BUG_ON(). >>> >>> To fix it, get folio refcount before calling folio_test_large(). >>> >>> Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports >>> thp migration") >>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> >>> --- >>> mm/memory_hotplug.c | 12 +++--------- >>> 1 file changed, 3 insertions(+), 9 deletions(-) >>> >>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >>> index 16cf9e17077e..f600c26ce5de 100644 >>> --- a/mm/memory_hotplug.c >>> +++ b/mm/memory_hotplug.c >>> @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long >>> start_pfn, unsigned long end_pfn) >>> page = pfn_to_page(pfn); >>> folio = page_folio(page); >>> - /* >>> - * No reference or lock is held on the folio, so it might >>> - * be modified concurrently (e.g. split). As such, >>> - * folio_nr_pages() may read garbage. This is fine as the >>> outer >>> - * loop will revisit the split folio later. >>> - */ >>> - if (folio_test_large(folio)) >>> - pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; >>> - >>> if (!folio_try_get(folio)) >>> continue; >>> if (unlikely(page_folio(page) != folio)) >>> goto put_folio; >>> + if (folio_test_large(folio)) >>> + pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; >> >> Moving that will not make it able to skip the large frozen >> (refcount==0, e.g., free hugetlb) folio in the continue/put_folio case >> above. Hmmmm .. > For free hugetlb, pfn is increased by 1 in each loop. This leads to skip > free hugetlb slower. Yes. But now I realize that we have the same issue with free buddy pages already (folio_try_get of each individual page :( ). >> >> We could similarly to dumping folios, snapshot them, so we can read >> stable data. > extract the code in __dump_page()? But snapshot may lead to > do_migrate_range() slower too. There is a patch series on the list to do that, but it might take a while to clean that up. Ideally, we'd also jump over free buddy pages. In the future we might have better ways to do that. I don't consider this change here really important, but if all it affects is free hugetlb folios, it's not really worth it to have this code around. Acked-by: David Hildenbrand <david@redhat.com> But: I suspect 8135d8926c08 is not the introducing commit. Please re-verify. We should not CC stable.
在 2025/3/26 3:05, David Hildenbrand 写道: > On 25.03.25 04:02, Jinjiang Tu wrote: >> >> 在 2025/3/24 21:44, David Hildenbrand 写道: >>> On 24.03.25 14:17, Jinjiang Tu wrote: >>>> We triggered the below BUG: >>>> >>>> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 >>>> pfn:0x240402 >>>> head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 >>>> pincount:0 >>>> flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff) >>>> page_type: f4(hugetlb) >>>> page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1) >>>> ------------[ cut here ]------------ >>>> kernel BUG at ./include/linux/page-flags.h:310! >>>> Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP >>>> Modules linked in: >>>> CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374 >>>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >>>> pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>> pc : const_folio_flags+0x3c/0x58 >>>> lr : const_folio_flags+0x3c/0x58 >>>> Call trace: >>>> const_folio_flags+0x3c/0x58 (P) >>>> do_migrate_range+0x164/0x720 >>>> offline_pages+0x63c/0x6fc >>>> memory_subsys_offline+0x190/0x1f4 >>>> device_offline+0xc0/0x13c >>>> state_store+0x90/0xd8 >>>> dev_attr_store+0x18/0x2c >>>> sysfs_kf_write+0x44/0x54 >>>> kernfs_fop_write_iter+0x120/0x1cc >>>> vfs_write+0x240/0x378 >>>> ksys_write+0x70/0x108 >>>> __arm64_sys_write+0x1c/0x28 >>>> invoke_syscall+0x48/0x10c >>>> el0_svc_common.constprop.0+0x40/0xe0 >>>> >>>> When allocating a hugetlb folio, between the folio is taken from buddy >>>> and prep_compound_page() is called, start_isolate_page_range() and >>>> do_migrate_range() is called. When do_migrate_range() scans the head >>>> page >>>> of the hugetlb folio, the compound_head field isn't set, so scans the >>>> tail page next. And at this time, the compound_head field of tail >>>> page is >>>> set, folio_test_large() is called by tail page, thus triggers >>>> VM_BUG_ON(). >>>> >>>> To fix it, get folio refcount before calling folio_test_large(). >>>> >>>> Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports >>>> thp migration") >>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> >>>> --- >>>> mm/memory_hotplug.c | 12 +++--------- >>>> 1 file changed, 3 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >>>> index 16cf9e17077e..f600c26ce5de 100644 >>>> --- a/mm/memory_hotplug.c >>>> +++ b/mm/memory_hotplug.c >>>> @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long >>>> start_pfn, unsigned long end_pfn) >>>> page = pfn_to_page(pfn); >>>> folio = page_folio(page); >>>> - /* >>>> - * No reference or lock is held on the folio, so it might >>>> - * be modified concurrently (e.g. split). As such, >>>> - * folio_nr_pages() may read garbage. This is fine as the >>>> outer >>>> - * loop will revisit the split folio later. >>>> - */ >>>> - if (folio_test_large(folio)) >>>> - pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; >>>> - >>>> if (!folio_try_get(folio)) >>>> continue; >>>> if (unlikely(page_folio(page) != folio)) >>>> goto put_folio; >>>> + if (folio_test_large(folio)) >>>> + pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; >>> >>> Moving that will not make it able to skip the large frozen >>> (refcount==0, e.g., free hugetlb) folio in the continue/put_folio case >>> above. Hmmmm .. >> For free hugetlb, pfn is increased by 1 in each loop. This leads to skip >> free hugetlb slower. > > Yes. But now I realize that we have the same issue with free buddy > pages already (folio_try_get of each individual page :( ). > >>> >>> We could similarly to dumping folios, snapshot them, so we can read >>> stable data. >> extract the code in __dump_page()? But snapshot may lead to >> do_migrate_range() slower too. > > There is a patch series on the list to do that, but it might take a > while to clean that up. Ideally, we'd also jump over free buddy pages. > In the future we might have better ways to do that. > > I don't consider this change here really important, but if all it > affects is free hugetlb folios, it's not really worth it to have this > code around. > > Acked-by: David Hildenbrand <david@redhat.com> > > But: I suspect 8135d8926c08 is not the introducing commit. Please > re-verify. commit 8135d8926c08 adds PageTransHuge() call, which may lead to VM_BUG_ON if the page is tail page. Before it, for hugetlb, PageHuge()、compound_head() and compound_order() are called before getting page ref, PageHuge()、compound_head() allows to pass a tail page, compound_order() will not trigger trigger VM_BUG_ON for tail page, and only lead to reading garbage data, leading to fail offline, but it will not trigger any VM_BUG_ON. So I think 8135d8926c08 is the introducing commit. Thanks. > > We should not CC stable. >
On 26.03.25 03:40, Jinjiang Tu wrote: > > 在 2025/3/26 3:05, David Hildenbrand 写道: >> On 25.03.25 04:02, Jinjiang Tu wrote: >>> >>> 在 2025/3/24 21:44, David Hildenbrand 写道: >>>> On 24.03.25 14:17, Jinjiang Tu wrote: >>>>> We triggered the below BUG: >>>>> >>>>> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 >>>>> pfn:0x240402 >>>>> head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 >>>>> pincount:0 >>>>> flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff) >>>>> page_type: f4(hugetlb) >>>>> page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1) >>>>> ------------[ cut here ]------------ >>>>> kernel BUG at ./include/linux/page-flags.h:310! >>>>> Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP >>>>> Modules linked in: >>>>> CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374 >>>>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >>>>> pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>>> pc : const_folio_flags+0x3c/0x58 >>>>> lr : const_folio_flags+0x3c/0x58 >>>>> Call trace: >>>>> const_folio_flags+0x3c/0x58 (P) >>>>> do_migrate_range+0x164/0x720 >>>>> offline_pages+0x63c/0x6fc >>>>> memory_subsys_offline+0x190/0x1f4 >>>>> device_offline+0xc0/0x13c >>>>> state_store+0x90/0xd8 >>>>> dev_attr_store+0x18/0x2c >>>>> sysfs_kf_write+0x44/0x54 >>>>> kernfs_fop_write_iter+0x120/0x1cc >>>>> vfs_write+0x240/0x378 >>>>> ksys_write+0x70/0x108 >>>>> __arm64_sys_write+0x1c/0x28 >>>>> invoke_syscall+0x48/0x10c >>>>> el0_svc_common.constprop.0+0x40/0xe0 >>>>> >>>>> When allocating a hugetlb folio, between the folio is taken from buddy >>>>> and prep_compound_page() is called, start_isolate_page_range() and >>>>> do_migrate_range() is called. When do_migrate_range() scans the head >>>>> page >>>>> of the hugetlb folio, the compound_head field isn't set, so scans the >>>>> tail page next. And at this time, the compound_head field of tail >>>>> page is >>>>> set, folio_test_large() is called by tail page, thus triggers >>>>> VM_BUG_ON(). >>>>> >>>>> To fix it, get folio refcount before calling folio_test_large(). >>>>> >>>>> Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports >>>>> thp migration") >>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> >>>>> --- >>>>> mm/memory_hotplug.c | 12 +++--------- >>>>> 1 file changed, 3 insertions(+), 9 deletions(-) >>>>> >>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >>>>> index 16cf9e17077e..f600c26ce5de 100644 >>>>> --- a/mm/memory_hotplug.c >>>>> +++ b/mm/memory_hotplug.c >>>>> @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long >>>>> start_pfn, unsigned long end_pfn) >>>>> page = pfn_to_page(pfn); >>>>> folio = page_folio(page); >>>>> - /* >>>>> - * No reference or lock is held on the folio, so it might >>>>> - * be modified concurrently (e.g. split). As such, >>>>> - * folio_nr_pages() may read garbage. This is fine as the >>>>> outer >>>>> - * loop will revisit the split folio later. >>>>> - */ >>>>> - if (folio_test_large(folio)) >>>>> - pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; >>>>> - >>>>> if (!folio_try_get(folio)) >>>>> continue; >>>>> if (unlikely(page_folio(page) != folio)) >>>>> goto put_folio; >>>>> + if (folio_test_large(folio)) >>>>> + pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; >>>> >>>> Moving that will not make it able to skip the large frozen >>>> (refcount==0, e.g., free hugetlb) folio in the continue/put_folio case >>>> above. Hmmmm .. >>> For free hugetlb, pfn is increased by 1 in each loop. This leads to skip >>> free hugetlb slower. >> >> Yes. But now I realize that we have the same issue with free buddy >> pages already (folio_try_get of each individual page :( ). >> >>>> >>>> We could similarly to dumping folios, snapshot them, so we can read >>>> stable data. >>> extract the code in __dump_page()? But snapshot may lead to >>> do_migrate_range() slower too. >> >> There is a patch series on the list to do that, but it might take a >> while to clean that up. Ideally, we'd also jump over free buddy pages. >> In the future we might have better ways to do that. >> >> I don't consider this change here really important, but if all it >> affects is free hugetlb folios, it's not really worth it to have this >> code around. >> >> Acked-by: David Hildenbrand <david@redhat.com> >> >> But: I suspect 8135d8926c08 is not the introducing commit. Please >> re-verify. > > commit 8135d8926c08 adds PageTransHuge() call, which may lead to VM_BUG_ON if the page is tail page. Right; we check for PageHuge before that. > Before it, for hugetlb, PageHuge()、compound_head() and compound_order() are called before getting > page ref, PageHuge()、compound_head() allows to pass a tail page, compound_order() will not trigger > trigger VM_BUG_ON for tail page, and only lead to reading garbage data, leading to fail offline, but We will retry as documented, so offlining should not fail. > it will not trigger any VM_BUG_ON. So I think 8135d8926c08 is the introducing commit. Not for the hugetlb issue you describe I assume. That should be caused by commit b62b51d2d1593e6590669e781768c3c5a7394eb5 Author: Kefeng Wang <wangkefeng.wang@huawei.com> Date: Tue Aug 27 19:47:24 2024 +0800 mm: memory_hotplug: remove head variable in do_migrate_range() Patch series "mm: memory_hotplug: improve do_migrate_range()", v3. So probably we should have both commits as Fixes (one for the hugetlb path, one for the THP path)
在 2025/3/26 20:53, David Hildenbrand 写道: > On 26.03.25 03:40, Jinjiang Tu wrote: >> >> 在 2025/3/26 3:05, David Hildenbrand 写道: >>> On 25.03.25 04:02, Jinjiang Tu wrote: >>>> >>>> 在 2025/3/24 21:44, David Hildenbrand 写道: >>>>> On 24.03.25 14:17, Jinjiang Tu wrote: >>>>>> We triggered the below BUG: >>>>>> >>>>>> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 >>>>>> pfn:0x240402 >>>>>> head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 >>>>>> pincount:0 >>>>>> flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff) >>>>>> page_type: f4(hugetlb) >>>>>> page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1) >>>>>> ------------[ cut here ]------------ >>>>>> kernel BUG at ./include/linux/page-flags.h:310! >>>>>> Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP >>>>>> Modules linked in: >>>>>> CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty >>>>>> #374 >>>>>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >>>>>> pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>>>> pc : const_folio_flags+0x3c/0x58 >>>>>> lr : const_folio_flags+0x3c/0x58 >>>>>> Call trace: >>>>>> const_folio_flags+0x3c/0x58 (P) >>>>>> do_migrate_range+0x164/0x720 >>>>>> offline_pages+0x63c/0x6fc >>>>>> memory_subsys_offline+0x190/0x1f4 >>>>>> device_offline+0xc0/0x13c >>>>>> state_store+0x90/0xd8 >>>>>> dev_attr_store+0x18/0x2c >>>>>> sysfs_kf_write+0x44/0x54 >>>>>> kernfs_fop_write_iter+0x120/0x1cc >>>>>> vfs_write+0x240/0x378 >>>>>> ksys_write+0x70/0x108 >>>>>> __arm64_sys_write+0x1c/0x28 >>>>>> invoke_syscall+0x48/0x10c >>>>>> el0_svc_common.constprop.0+0x40/0xe0 >>>>>> >>>>>> When allocating a hugetlb folio, between the folio is taken from >>>>>> buddy >>>>>> and prep_compound_page() is called, start_isolate_page_range() and >>>>>> do_migrate_range() is called. When do_migrate_range() scans the head >>>>>> page >>>>>> of the hugetlb folio, the compound_head field isn't set, so scans >>>>>> the >>>>>> tail page next. And at this time, the compound_head field of tail >>>>>> page is >>>>>> set, folio_test_large() is called by tail page, thus triggers >>>>>> VM_BUG_ON(). >>>>>> >>>>>> To fix it, get folio refcount before calling folio_test_large(). >>>>>> >>>>>> Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports >>>>>> thp migration") >>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> >>>>>> --- >>>>>> mm/memory_hotplug.c | 12 +++--------- >>>>>> 1 file changed, 3 insertions(+), 9 deletions(-) >>>>>> >>>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >>>>>> index 16cf9e17077e..f600c26ce5de 100644 >>>>>> --- a/mm/memory_hotplug.c >>>>>> +++ b/mm/memory_hotplug.c >>>>>> @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long >>>>>> start_pfn, unsigned long end_pfn) >>>>>> page = pfn_to_page(pfn); >>>>>> folio = page_folio(page); >>>>>> - /* >>>>>> - * No reference or lock is held on the folio, so it might >>>>>> - * be modified concurrently (e.g. split). As such, >>>>>> - * folio_nr_pages() may read garbage. This is fine as the >>>>>> outer >>>>>> - * loop will revisit the split folio later. >>>>>> - */ >>>>>> - if (folio_test_large(folio)) >>>>>> - pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; >>>>>> - >>>>>> if (!folio_try_get(folio)) >>>>>> continue; >>>>>> if (unlikely(page_folio(page) != folio)) >>>>>> goto put_folio; >>>>>> + if (folio_test_large(folio)) >>>>>> + pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; >>>>> >>>>> Moving that will not make it able to skip the large frozen >>>>> (refcount==0, e.g., free hugetlb) folio in the continue/put_folio >>>>> case >>>>> above. Hmmmm .. >>>> For free hugetlb, pfn is increased by 1 in each loop. This leads to >>>> skip >>>> free hugetlb slower. >>> >>> Yes. But now I realize that we have the same issue with free buddy >>> pages already (folio_try_get of each individual page :( ). >>> >>>>> >>>>> We could similarly to dumping folios, snapshot them, so we can read >>>>> stable data. >>>> extract the code in __dump_page()? But snapshot may lead to >>>> do_migrate_range() slower too. >>> >>> There is a patch series on the list to do that, but it might take a >>> while to clean that up. Ideally, we'd also jump over free buddy pages. >>> In the future we might have better ways to do that. >>> >>> I don't consider this change here really important, but if all it >>> affects is free hugetlb folios, it's not really worth it to have this >>> code around. >>> >>> Acked-by: David Hildenbrand <david@redhat.com> >>> >>> But: I suspect 8135d8926c08 is not the introducing commit. Please >>> re-verify. >> >> commit 8135d8926c08 adds PageTransHuge() call, which may lead to >> VM_BUG_ON if the page is tail page. > > Right; we check for PageHuge before that. > >> Before it, for hugetlb, PageHuge()、compound_head() and >> compound_order() are called before getting >> page ref, PageHuge()、compound_head() allows to pass a tail page, >> compound_order() will not trigger >> trigger VM_BUG_ON for tail page, and only lead to reading garbage >> data, leading to fail offline, but > > We will retry as documented, so offlining should not fail. > >> it will not trigger any VM_BUG_ON. So I think 8135d8926c08 is the >> introducing commit. > > Not for the hugetlb issue you describe I assume. That should be caused by > > commit b62b51d2d1593e6590669e781768c3c5a7394eb5 > Author: Kefeng Wang <wangkefeng.wang@huawei.com> > Date: Tue Aug 27 19:47:24 2024 +0800 > > mm: memory_hotplug: remove head variable in do_migrate_range() > > Patch series "mm: memory_hotplug: improve do_migrate_range()", v3. > > So probably we should have both commits as Fixes (one for the hugetlb > path, one for the THP path) > Yes, b62b51d2d159 ("mm: memory_hotplug: remove head variable in do_migrate_range()") introduced this for hugetlb. Thanks.
On Mon, Mar 24, 2025 at 09:17:50PM +0800, Jinjiang Tu wrote: > We triggered the below BUG: > ... > When allocating a hugetlb folio, between the folio is taken from buddy > and prep_compound_page() is called, start_isolate_page_range() and > do_migrate_range() is called. When do_migrate_range() scans the head page > of the hugetlb folio, the compound_head field isn't set, so scans the > tail page next. And at this time, the compound_head field of tail page is > set, folio_test_large() is called by tail page, thus triggers VM_BUG_ON(). > > To fix it, get folio refcount before calling folio_test_large(). > > Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration") > Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> As David mentions, I think too that this was introduced in commit: b62b51d2d1593e65 ("mm: memory_hotplug: remove head variable in do_migrate_range()")
On Tue, 25 Mar 2025 20:05:50 +0100 David Hildenbrand <david@redhat.com> wrote: > On 25.03.25 04:02, Jinjiang Tu wrote: > > > >> Moving that will not make it able to skip the large frozen > >> (refcount==0, e.g., free hugetlb) folio in the continue/put_folio case > >> above. Hmmmm .. > > For free hugetlb, pfn is increased by 1 in each loop. This leads to skip > > free hugetlb slower. > > Yes. But now I realize that we have the same issue with free buddy pages > already (folio_try_get of each individual page :( ). > > >> > >> We could similarly to dumping folios, snapshot them, so we can read > >> stable data. > > extract the code in __dump_page()? But snapshot may lead to > > do_migrate_range() slower too. > > There is a patch series on the list to do that, but it might take a > while to clean that up. Ideally, we'd also jump over free buddy pages. > In the future we might have better ways to do that. > > I don't consider this change here really important, but if all it > affects is free hugetlb folios, it's not really worth it to have this > code around. > > Acked-by: David Hildenbrand <david@redhat.com> > > But: I suspect 8135d8926c08 is not the introducing commit. Please re-verify. > > We should not CC stable. Why shouldn't we cc:stable? A userspace-triggerable BUG?
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 16cf9e17077e..f600c26ce5de 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) page = pfn_to_page(pfn); folio = page_folio(page); - /* - * No reference or lock is held on the folio, so it might - * be modified concurrently (e.g. split). As such, - * folio_nr_pages() may read garbage. This is fine as the outer - * loop will revisit the split folio later. - */ - if (folio_test_large(folio)) - pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; - if (!folio_try_get(folio)) continue; if (unlikely(page_folio(page) != folio)) goto put_folio; + if (folio_test_large(folio)) + pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; + if (folio_test_hwpoison(folio) || (folio_test_large(folio) && folio_test_has_hwpoisoned(folio))) { if (WARN_ON(folio_test_lru(folio)))
We triggered the below BUG: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 pfn:0x240402 head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff) page_type: f4(hugetlb) page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1) ------------[ cut here ]------------ kernel BUG at ./include/linux/page-flags.h:310! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : const_folio_flags+0x3c/0x58 lr : const_folio_flags+0x3c/0x58 Call trace: const_folio_flags+0x3c/0x58 (P) do_migrate_range+0x164/0x720 offline_pages+0x63c/0x6fc memory_subsys_offline+0x190/0x1f4 device_offline+0xc0/0x13c state_store+0x90/0xd8 dev_attr_store+0x18/0x2c sysfs_kf_write+0x44/0x54 kernfs_fop_write_iter+0x120/0x1cc vfs_write+0x240/0x378 ksys_write+0x70/0x108 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x48/0x10c el0_svc_common.constprop.0+0x40/0xe0 When allocating a hugetlb folio, between the folio is taken from buddy and prep_compound_page() is called, start_isolate_page_range() and do_migrate_range() is called. When do_migrate_range() scans the head page of the hugetlb folio, the compound_head field isn't set, so scans the tail page next. And at this time, the compound_head field of tail page is set, folio_test_large() is called by tail page, thus triggers VM_BUG_ON(). To fix it, get folio refcount before calling folio_test_large(). Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> --- mm/memory_hotplug.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-)