diff mbox series

mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_range

Message ID 20250324131750.1551884-1-tujinjiang@huawei.com (mailing list archive)
State New
Headers show
Series mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_range | expand

Commit Message

Jinjiang Tu March 24, 2025, 1:17 p.m. UTC
We triggered the below BUG:

 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 pfn:0x240402
 head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
 flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff)
 page_type: f4(hugetlb)
 page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1)
 ------------[ cut here ]------------
 kernel BUG at ./include/linux/page-flags.h:310!
 Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
 Modules linked in:
 CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374
 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
 pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : const_folio_flags+0x3c/0x58
 lr : const_folio_flags+0x3c/0x58
 Call trace:
  const_folio_flags+0x3c/0x58 (P)
  do_migrate_range+0x164/0x720
  offline_pages+0x63c/0x6fc
  memory_subsys_offline+0x190/0x1f4
  device_offline+0xc0/0x13c
  state_store+0x90/0xd8
  dev_attr_store+0x18/0x2c
  sysfs_kf_write+0x44/0x54
  kernfs_fop_write_iter+0x120/0x1cc
  vfs_write+0x240/0x378
  ksys_write+0x70/0x108
  __arm64_sys_write+0x1c/0x28
  invoke_syscall+0x48/0x10c
  el0_svc_common.constprop.0+0x40/0xe0

When allocating a hugetlb folio, between the folio is taken from buddy
and prep_compound_page() is called, start_isolate_page_range() and
do_migrate_range() is called. When do_migrate_range() scans the head page
of the hugetlb folio, the compound_head field isn't set, so scans the
tail page next. And at this time, the compound_head field of tail page is
set, folio_test_large() is called by tail page, thus triggers VM_BUG_ON().

To fix it, get folio refcount before calling folio_test_large().

Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration")
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
---
 mm/memory_hotplug.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

Comments

David Hildenbrand March 24, 2025, 1:44 p.m. UTC | #1
On 24.03.25 14:17, Jinjiang Tu wrote:
> We triggered the below BUG:
> 
>   page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 pfn:0x240402
>   head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>   flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff)
>   page_type: f4(hugetlb)
>   page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1)
>   ------------[ cut here ]------------
>   kernel BUG at ./include/linux/page-flags.h:310!
>   Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>   Modules linked in:
>   CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374
>   Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>   pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>   pc : const_folio_flags+0x3c/0x58
>   lr : const_folio_flags+0x3c/0x58
>   Call trace:
>    const_folio_flags+0x3c/0x58 (P)
>    do_migrate_range+0x164/0x720
>    offline_pages+0x63c/0x6fc
>    memory_subsys_offline+0x190/0x1f4
>    device_offline+0xc0/0x13c
>    state_store+0x90/0xd8
>    dev_attr_store+0x18/0x2c
>    sysfs_kf_write+0x44/0x54
>    kernfs_fop_write_iter+0x120/0x1cc
>    vfs_write+0x240/0x378
>    ksys_write+0x70/0x108
>    __arm64_sys_write+0x1c/0x28
>    invoke_syscall+0x48/0x10c
>    el0_svc_common.constprop.0+0x40/0xe0
> 
> When allocating a hugetlb folio, between the folio is taken from buddy
> and prep_compound_page() is called, start_isolate_page_range() and
> do_migrate_range() is called. When do_migrate_range() scans the head page
> of the hugetlb folio, the compound_head field isn't set, so scans the
> tail page next. And at this time, the compound_head field of tail page is
> set, folio_test_large() is called by tail page, thus triggers VM_BUG_ON().
> 
> To fix it, get folio refcount before calling folio_test_large().
> 
> Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration")
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> ---
>   mm/memory_hotplug.c | 12 +++---------
>   1 file changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 16cf9e17077e..f600c26ce5de 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>   		page = pfn_to_page(pfn);
>   		folio = page_folio(page);
>   
> -		/*
> -		 * No reference or lock is held on the folio, so it might
> -		 * be modified concurrently (e.g. split).  As such,
> -		 * folio_nr_pages() may read garbage.  This is fine as the outer
> -		 * loop will revisit the split folio later.
> -		 */
> -		if (folio_test_large(folio))
> -			pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
> -
>   		if (!folio_try_get(folio))
>   			continue;
>   
>   		if (unlikely(page_folio(page) != folio))
>   			goto put_folio;
>   
> +		if (folio_test_large(folio))
> +			pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;

Moving that will not make it able to skip the large frozen (refcount==0, 
e.g., free hugetlb) folio in the continue/put_folio case above. Hmmmm ..

We could similarly to dumping folios, snapshot them, so we can read 
stable data.
Jinjiang Tu March 25, 2025, 3:02 a.m. UTC | #2
在 2025/3/24 21:44, David Hildenbrand 写道:
> On 24.03.25 14:17, Jinjiang Tu wrote:
>> We triggered the below BUG:
>>
>>   page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 
>> pfn:0x240402
>>   head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 
>> pincount:0
>>   flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff)
>>   page_type: f4(hugetlb)
>>   page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1)
>>   ------------[ cut here ]------------
>>   kernel BUG at ./include/linux/page-flags.h:310!
>>   Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>   Modules linked in:
>>   CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374
>>   Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>>   pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>   pc : const_folio_flags+0x3c/0x58
>>   lr : const_folio_flags+0x3c/0x58
>>   Call trace:
>>    const_folio_flags+0x3c/0x58 (P)
>>    do_migrate_range+0x164/0x720
>>    offline_pages+0x63c/0x6fc
>>    memory_subsys_offline+0x190/0x1f4
>>    device_offline+0xc0/0x13c
>>    state_store+0x90/0xd8
>>    dev_attr_store+0x18/0x2c
>>    sysfs_kf_write+0x44/0x54
>>    kernfs_fop_write_iter+0x120/0x1cc
>>    vfs_write+0x240/0x378
>>    ksys_write+0x70/0x108
>>    __arm64_sys_write+0x1c/0x28
>>    invoke_syscall+0x48/0x10c
>>    el0_svc_common.constprop.0+0x40/0xe0
>>
>> When allocating a hugetlb folio, between the folio is taken from buddy
>> and prep_compound_page() is called, start_isolate_page_range() and
>> do_migrate_range() is called. When do_migrate_range() scans the head 
>> page
>> of the hugetlb folio, the compound_head field isn't set, so scans the
>> tail page next. And at this time, the compound_head field of tail 
>> page is
>> set, folio_test_large() is called by tail page, thus triggers 
>> VM_BUG_ON().
>>
>> To fix it, get folio refcount before calling folio_test_large().
>>
>> Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports 
>> thp migration")
>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>> ---
>>   mm/memory_hotplug.c | 12 +++---------
>>   1 file changed, 3 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 16cf9e17077e..f600c26ce5de 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long 
>> start_pfn, unsigned long end_pfn)
>>           page = pfn_to_page(pfn);
>>           folio = page_folio(page);
>>   -        /*
>> -         * No reference or lock is held on the folio, so it might
>> -         * be modified concurrently (e.g. split).  As such,
>> -         * folio_nr_pages() may read garbage.  This is fine as the 
>> outer
>> -         * loop will revisit the split folio later.
>> -         */
>> -        if (folio_test_large(folio))
>> -            pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>> -
>>           if (!folio_try_get(folio))
>>               continue;
>>             if (unlikely(page_folio(page) != folio))
>>               goto put_folio;
>>   +        if (folio_test_large(folio))
>> +            pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>
> Moving that will not make it able to skip the large frozen 
> (refcount==0, e.g., free hugetlb) folio in the continue/put_folio case 
> above. Hmmmm ..
For free hugetlb, pfn is increased by 1 in each loop. This leads to skip 
free hugetlb slower.
>
> We could similarly to dumping folios, snapshot them, so we can read 
> stable data.
extract the code in __dump_page()? But snapshot may lead to 
do_migrate_range() slower too.
Oscar Salvador March 25, 2025, 1:18 p.m. UTC | #3
On Tue, Mar 25, 2025 at 11:02:30AM +0800, Jinjiang Tu wrote:
> extract the code in __dump_page()? But snapshot may lead to
> do_migrate_range() slower too.

Yes, but offline is already a slow operationg anyway, so it might be
worth doing if we do not want to step into this "middle-operations".
David Hildenbrand March 25, 2025, 7:05 p.m. UTC | #4
On 25.03.25 04:02, Jinjiang Tu wrote:
> 
> 在 2025/3/24 21:44, David Hildenbrand 写道:
>> On 24.03.25 14:17, Jinjiang Tu wrote:
>>> We triggered the below BUG:
>>>
>>>    page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2
>>> pfn:0x240402
>>>    head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0
>>> pincount:0
>>>    flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff)
>>>    page_type: f4(hugetlb)
>>>    page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1)
>>>    ------------[ cut here ]------------
>>>    kernel BUG at ./include/linux/page-flags.h:310!
>>>    Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>    Modules linked in:
>>>    CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374
>>>    Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>>>    pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>    pc : const_folio_flags+0x3c/0x58
>>>    lr : const_folio_flags+0x3c/0x58
>>>    Call trace:
>>>     const_folio_flags+0x3c/0x58 (P)
>>>     do_migrate_range+0x164/0x720
>>>     offline_pages+0x63c/0x6fc
>>>     memory_subsys_offline+0x190/0x1f4
>>>     device_offline+0xc0/0x13c
>>>     state_store+0x90/0xd8
>>>     dev_attr_store+0x18/0x2c
>>>     sysfs_kf_write+0x44/0x54
>>>     kernfs_fop_write_iter+0x120/0x1cc
>>>     vfs_write+0x240/0x378
>>>     ksys_write+0x70/0x108
>>>     __arm64_sys_write+0x1c/0x28
>>>     invoke_syscall+0x48/0x10c
>>>     el0_svc_common.constprop.0+0x40/0xe0
>>>
>>> When allocating a hugetlb folio, between the folio is taken from buddy
>>> and prep_compound_page() is called, start_isolate_page_range() and
>>> do_migrate_range() is called. When do_migrate_range() scans the head
>>> page
>>> of the hugetlb folio, the compound_head field isn't set, so scans the
>>> tail page next. And at this time, the compound_head field of tail
>>> page is
>>> set, folio_test_large() is called by tail page, thus triggers
>>> VM_BUG_ON().
>>>
>>> To fix it, get folio refcount before calling folio_test_large().
>>>
>>> Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports
>>> thp migration")
>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>> ---
>>>    mm/memory_hotplug.c | 12 +++---------
>>>    1 file changed, 3 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 16cf9e17077e..f600c26ce5de 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long
>>> start_pfn, unsigned long end_pfn)
>>>            page = pfn_to_page(pfn);
>>>            folio = page_folio(page);
>>>    -        /*
>>> -         * No reference or lock is held on the folio, so it might
>>> -         * be modified concurrently (e.g. split).  As such,
>>> -         * folio_nr_pages() may read garbage.  This is fine as the
>>> outer
>>> -         * loop will revisit the split folio later.
>>> -         */
>>> -        if (folio_test_large(folio))
>>> -            pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>>> -
>>>            if (!folio_try_get(folio))
>>>                continue;
>>>              if (unlikely(page_folio(page) != folio))
>>>                goto put_folio;
>>>    +        if (folio_test_large(folio))
>>> +            pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>>
>> Moving that will not make it able to skip the large frozen
>> (refcount==0, e.g., free hugetlb) folio in the continue/put_folio case
>> above. Hmmmm ..
> For free hugetlb, pfn is increased by 1 in each loop. This leads to skip
> free hugetlb slower.

Yes. But now I realize that we have the same issue with free buddy pages 
already (folio_try_get of each individual page :( ).

>>
>> We could similarly to dumping folios, snapshot them, so we can read
>> stable data.
> extract the code in __dump_page()? But snapshot may lead to
> do_migrate_range() slower too.

There is a patch series on the list to do that, but it might take a 
while to clean that up. Ideally, we'd also jump over free buddy pages. 
In the future we might have better ways to do that.

I don't consider this change here really important, but if all it 
affects is free hugetlb folios, it's not really worth it to have this 
code around.

Acked-by: David Hildenbrand <david@redhat.com>

But: I suspect 8135d8926c08 is not the introducing commit. Please re-verify.

We should not CC stable.
Jinjiang Tu March 26, 2025, 2:40 a.m. UTC | #5
在 2025/3/26 3:05, David Hildenbrand 写道:
> On 25.03.25 04:02, Jinjiang Tu wrote:
>>
>> 在 2025/3/24 21:44, David Hildenbrand 写道:
>>> On 24.03.25 14:17, Jinjiang Tu wrote:
>>>> We triggered the below BUG:
>>>>
>>>>    page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2
>>>> pfn:0x240402
>>>>    head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0
>>>> pincount:0
>>>>    flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff)
>>>>    page_type: f4(hugetlb)
>>>>    page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1)
>>>>    ------------[ cut here ]------------
>>>>    kernel BUG at ./include/linux/page-flags.h:310!
>>>>    Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>>    Modules linked in:
>>>>    CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374
>>>>    Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>>>>    pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>    pc : const_folio_flags+0x3c/0x58
>>>>    lr : const_folio_flags+0x3c/0x58
>>>>    Call trace:
>>>>     const_folio_flags+0x3c/0x58 (P)
>>>>     do_migrate_range+0x164/0x720
>>>>     offline_pages+0x63c/0x6fc
>>>>     memory_subsys_offline+0x190/0x1f4
>>>>     device_offline+0xc0/0x13c
>>>>     state_store+0x90/0xd8
>>>>     dev_attr_store+0x18/0x2c
>>>>     sysfs_kf_write+0x44/0x54
>>>>     kernfs_fop_write_iter+0x120/0x1cc
>>>>     vfs_write+0x240/0x378
>>>>     ksys_write+0x70/0x108
>>>>     __arm64_sys_write+0x1c/0x28
>>>>     invoke_syscall+0x48/0x10c
>>>>     el0_svc_common.constprop.0+0x40/0xe0
>>>>
>>>> When allocating a hugetlb folio, between the folio is taken from buddy
>>>> and prep_compound_page() is called, start_isolate_page_range() and
>>>> do_migrate_range() is called. When do_migrate_range() scans the head
>>>> page
>>>> of the hugetlb folio, the compound_head field isn't set, so scans the
>>>> tail page next. And at this time, the compound_head field of tail
>>>> page is
>>>> set, folio_test_large() is called by tail page, thus triggers
>>>> VM_BUG_ON().
>>>>
>>>> To fix it, get folio refcount before calling folio_test_large().
>>>>
>>>> Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports
>>>> thp migration")
>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>> ---
>>>>    mm/memory_hotplug.c | 12 +++---------
>>>>    1 file changed, 3 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>> index 16cf9e17077e..f600c26ce5de 100644
>>>> --- a/mm/memory_hotplug.c
>>>> +++ b/mm/memory_hotplug.c
>>>> @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long
>>>> start_pfn, unsigned long end_pfn)
>>>>            page = pfn_to_page(pfn);
>>>>            folio = page_folio(page);
>>>>    -        /*
>>>> -         * No reference or lock is held on the folio, so it might
>>>> -         * be modified concurrently (e.g. split).  As such,
>>>> -         * folio_nr_pages() may read garbage.  This is fine as the
>>>> outer
>>>> -         * loop will revisit the split folio later.
>>>> -         */
>>>> -        if (folio_test_large(folio))
>>>> -            pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>>>> -
>>>>            if (!folio_try_get(folio))
>>>>                continue;
>>>>              if (unlikely(page_folio(page) != folio))
>>>>                goto put_folio;
>>>>    +        if (folio_test_large(folio))
>>>> +            pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>>>
>>> Moving that will not make it able to skip the large frozen
>>> (refcount==0, e.g., free hugetlb) folio in the continue/put_folio case
>>> above. Hmmmm ..
>> For free hugetlb, pfn is increased by 1 in each loop. This leads to skip
>> free hugetlb slower.
>
> Yes. But now I realize that we have the same issue with free buddy 
> pages already (folio_try_get of each individual page :( ).
>
>>>
>>> We could similarly to dumping folios, snapshot them, so we can read
>>> stable data.
>> extract the code in __dump_page()? But snapshot may lead to
>> do_migrate_range() slower too.
>
> There is a patch series on the list to do that, but it might take a 
> while to clean that up. Ideally, we'd also jump over free buddy pages. 
> In the future we might have better ways to do that.
>
> I don't consider this change here really important, but if all it 
> affects is free hugetlb folios, it's not really worth it to have this 
> code around.
>
> Acked-by: David Hildenbrand <david@redhat.com>
>
> But: I suspect 8135d8926c08 is not the introducing commit. Please 
> re-verify.

commit 8135d8926c08 adds PageTransHuge() call, which may lead to VM_BUG_ON if the page is tail page.
Before it, for hugetlb, PageHuge()、compound_head() and compound_order() are called before getting
page ref,  PageHuge()、compound_head() allows to pass a tail page, compound_order() will not trigger
trigger VM_BUG_ON for tail page, and only lead to reading garbage data, leading to fail offline, but
it will not trigger any VM_BUG_ON. So I think 8135d8926c08 is the introducing commit.

Thanks.

>
> We should not CC stable.
>
David Hildenbrand March 26, 2025, 12:53 p.m. UTC | #6
On 26.03.25 03:40, Jinjiang Tu wrote:
> 
> 在 2025/3/26 3:05, David Hildenbrand 写道:
>> On 25.03.25 04:02, Jinjiang Tu wrote:
>>>
>>> 在 2025/3/24 21:44, David Hildenbrand 写道:
>>>> On 24.03.25 14:17, Jinjiang Tu wrote:
>>>>> We triggered the below BUG:
>>>>>
>>>>>     page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2
>>>>> pfn:0x240402
>>>>>     head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0
>>>>> pincount:0
>>>>>     flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff)
>>>>>     page_type: f4(hugetlb)
>>>>>     page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1)
>>>>>     ------------[ cut here ]------------
>>>>>     kernel BUG at ./include/linux/page-flags.h:310!
>>>>>     Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>>>     Modules linked in:
>>>>>     CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374
>>>>>     Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>>>>>     pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>     pc : const_folio_flags+0x3c/0x58
>>>>>     lr : const_folio_flags+0x3c/0x58
>>>>>     Call trace:
>>>>>      const_folio_flags+0x3c/0x58 (P)
>>>>>      do_migrate_range+0x164/0x720
>>>>>      offline_pages+0x63c/0x6fc
>>>>>      memory_subsys_offline+0x190/0x1f4
>>>>>      device_offline+0xc0/0x13c
>>>>>      state_store+0x90/0xd8
>>>>>      dev_attr_store+0x18/0x2c
>>>>>      sysfs_kf_write+0x44/0x54
>>>>>      kernfs_fop_write_iter+0x120/0x1cc
>>>>>      vfs_write+0x240/0x378
>>>>>      ksys_write+0x70/0x108
>>>>>      __arm64_sys_write+0x1c/0x28
>>>>>      invoke_syscall+0x48/0x10c
>>>>>      el0_svc_common.constprop.0+0x40/0xe0
>>>>>
>>>>> When allocating a hugetlb folio, between the folio is taken from buddy
>>>>> and prep_compound_page() is called, start_isolate_page_range() and
>>>>> do_migrate_range() is called. When do_migrate_range() scans the head
>>>>> page
>>>>> of the hugetlb folio, the compound_head field isn't set, so scans the
>>>>> tail page next. And at this time, the compound_head field of tail
>>>>> page is
>>>>> set, folio_test_large() is called by tail page, thus triggers
>>>>> VM_BUG_ON().
>>>>>
>>>>> To fix it, get folio refcount before calling folio_test_large().
>>>>>
>>>>> Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports
>>>>> thp migration")
>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>>> ---
>>>>>     mm/memory_hotplug.c | 12 +++---------
>>>>>     1 file changed, 3 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>>> index 16cf9e17077e..f600c26ce5de 100644
>>>>> --- a/mm/memory_hotplug.c
>>>>> +++ b/mm/memory_hotplug.c
>>>>> @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long
>>>>> start_pfn, unsigned long end_pfn)
>>>>>             page = pfn_to_page(pfn);
>>>>>             folio = page_folio(page);
>>>>>     -        /*
>>>>> -         * No reference or lock is held on the folio, so it might
>>>>> -         * be modified concurrently (e.g. split).  As such,
>>>>> -         * folio_nr_pages() may read garbage.  This is fine as the
>>>>> outer
>>>>> -         * loop will revisit the split folio later.
>>>>> -         */
>>>>> -        if (folio_test_large(folio))
>>>>> -            pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>>>>> -
>>>>>             if (!folio_try_get(folio))
>>>>>                 continue;
>>>>>               if (unlikely(page_folio(page) != folio))
>>>>>                 goto put_folio;
>>>>>     +        if (folio_test_large(folio))
>>>>> +            pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>>>>
>>>> Moving that will not make it able to skip the large frozen
>>>> (refcount==0, e.g., free hugetlb) folio in the continue/put_folio case
>>>> above. Hmmmm ..
>>> For free hugetlb, pfn is increased by 1 in each loop. This leads to skip
>>> free hugetlb slower.
>>
>> Yes. But now I realize that we have the same issue with free buddy
>> pages already (folio_try_get of each individual page :( ).
>>
>>>>
>>>> We could similarly to dumping folios, snapshot them, so we can read
>>>> stable data.
>>> extract the code in __dump_page()? But snapshot may lead to
>>> do_migrate_range() slower too.
>>
>> There is a patch series on the list to do that, but it might take a
>> while to clean that up. Ideally, we'd also jump over free buddy pages.
>> In the future we might have better ways to do that.
>>
>> I don't consider this change here really important, but if all it
>> affects is free hugetlb folios, it's not really worth it to have this
>> code around.
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
>>
>> But: I suspect 8135d8926c08 is not the introducing commit. Please
>> re-verify.
> 
> commit 8135d8926c08 adds PageTransHuge() call, which may lead to VM_BUG_ON if the page is tail page.

Right; we check for PageHuge before that.

> Before it, for hugetlb, PageHuge()、compound_head() and compound_order() are called before getting
> page ref,  PageHuge()、compound_head() allows to pass a tail page, compound_order() will not trigger
> trigger VM_BUG_ON for tail page, and only lead to reading garbage data, leading to fail offline, but

We will retry as documented, so offlining should not fail.

> it will not trigger any VM_BUG_ON. So I think 8135d8926c08 is the introducing commit.

Not for the hugetlb issue you describe I assume. That should be caused by

commit b62b51d2d1593e6590669e781768c3c5a7394eb5
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Tue Aug 27 19:47:24 2024 +0800

     mm: memory_hotplug: remove head variable in do_migrate_range()

     Patch series "mm: memory_hotplug: improve do_migrate_range()", v3.

So probably we should have both commits as Fixes (one for the hugetlb 
path, one for the THP path)
Jinjiang Tu March 27, 2025, 11:19 a.m. UTC | #7
在 2025/3/26 20:53, David Hildenbrand 写道:
> On 26.03.25 03:40, Jinjiang Tu wrote:
>>
>> 在 2025/3/26 3:05, David Hildenbrand 写道:
>>> On 25.03.25 04:02, Jinjiang Tu wrote:
>>>>
>>>> 在 2025/3/24 21:44, David Hildenbrand 写道:
>>>>> On 24.03.25 14:17, Jinjiang Tu wrote:
>>>>>> We triggered the below BUG:
>>>>>>
>>>>>>     page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2
>>>>>> pfn:0x240402
>>>>>>     head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0
>>>>>> pincount:0
>>>>>>     flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff)
>>>>>>     page_type: f4(hugetlb)
>>>>>>     page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1)
>>>>>>     ------------[ cut here ]------------
>>>>>>     kernel BUG at ./include/linux/page-flags.h:310!
>>>>>>     Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>>>>     Modules linked in:
>>>>>>     CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty 
>>>>>> #374
>>>>>>     Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>>>>>>     pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>>     pc : const_folio_flags+0x3c/0x58
>>>>>>     lr : const_folio_flags+0x3c/0x58
>>>>>>     Call trace:
>>>>>>      const_folio_flags+0x3c/0x58 (P)
>>>>>>      do_migrate_range+0x164/0x720
>>>>>>      offline_pages+0x63c/0x6fc
>>>>>>      memory_subsys_offline+0x190/0x1f4
>>>>>>      device_offline+0xc0/0x13c
>>>>>>      state_store+0x90/0xd8
>>>>>>      dev_attr_store+0x18/0x2c
>>>>>>      sysfs_kf_write+0x44/0x54
>>>>>>      kernfs_fop_write_iter+0x120/0x1cc
>>>>>>      vfs_write+0x240/0x378
>>>>>>      ksys_write+0x70/0x108
>>>>>>      __arm64_sys_write+0x1c/0x28
>>>>>>      invoke_syscall+0x48/0x10c
>>>>>>      el0_svc_common.constprop.0+0x40/0xe0
>>>>>>
>>>>>> When allocating a hugetlb folio, between the folio is taken from 
>>>>>> buddy
>>>>>> and prep_compound_page() is called, start_isolate_page_range() and
>>>>>> do_migrate_range() is called. When do_migrate_range() scans the head
>>>>>> page
>>>>>> of the hugetlb folio, the compound_head field isn't set, so scans 
>>>>>> the
>>>>>> tail page next. And at this time, the compound_head field of tail
>>>>>> page is
>>>>>> set, folio_test_large() is called by tail page, thus triggers
>>>>>> VM_BUG_ON().
>>>>>>
>>>>>> To fix it, get folio refcount before calling folio_test_large().
>>>>>>
>>>>>> Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports
>>>>>> thp migration")
>>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>>>> ---
>>>>>>     mm/memory_hotplug.c | 12 +++---------
>>>>>>     1 file changed, 3 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>>>> index 16cf9e17077e..f600c26ce5de 100644
>>>>>> --- a/mm/memory_hotplug.c
>>>>>> +++ b/mm/memory_hotplug.c
>>>>>> @@ -1813,21 +1813,15 @@ static void do_migrate_range(unsigned long
>>>>>> start_pfn, unsigned long end_pfn)
>>>>>>             page = pfn_to_page(pfn);
>>>>>>             folio = page_folio(page);
>>>>>>     -        /*
>>>>>> -         * No reference or lock is held on the folio, so it might
>>>>>> -         * be modified concurrently (e.g. split).  As such,
>>>>>> -         * folio_nr_pages() may read garbage.  This is fine as the
>>>>>> outer
>>>>>> -         * loop will revisit the split folio later.
>>>>>> -         */
>>>>>> -        if (folio_test_large(folio))
>>>>>> -            pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>>>>>> -
>>>>>>             if (!folio_try_get(folio))
>>>>>>                 continue;
>>>>>>               if (unlikely(page_folio(page) != folio))
>>>>>>                 goto put_folio;
>>>>>>     +        if (folio_test_large(folio))
>>>>>> +            pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
>>>>>
>>>>> Moving that will not make it able to skip the large frozen
>>>>> (refcount==0, e.g., free hugetlb) folio in the continue/put_folio 
>>>>> case
>>>>> above. Hmmmm ..
>>>> For free hugetlb, pfn is increased by 1 in each loop. This leads to 
>>>> skip
>>>> free hugetlb slower.
>>>
>>> Yes. But now I realize that we have the same issue with free buddy
>>> pages already (folio_try_get of each individual page :( ).
>>>
>>>>>
>>>>> We could similarly to dumping folios, snapshot them, so we can read
>>>>> stable data.
>>>> extract the code in __dump_page()? But snapshot may lead to
>>>> do_migrate_range() slower too.
>>>
>>> There is a patch series on the list to do that, but it might take a
>>> while to clean that up. Ideally, we'd also jump over free buddy pages.
>>> In the future we might have better ways to do that.
>>>
>>> I don't consider this change here really important, but if all it
>>> affects is free hugetlb folios, it's not really worth it to have this
>>> code around.
>>>
>>> Acked-by: David Hildenbrand <david@redhat.com>
>>>
>>> But: I suspect 8135d8926c08 is not the introducing commit. Please
>>> re-verify.
>>
>> commit 8135d8926c08 adds PageTransHuge() call, which may lead to 
>> VM_BUG_ON if the page is tail page.
>
> Right; we check for PageHuge before that.
>
>> Before it, for hugetlb, PageHuge()、compound_head() and 
>> compound_order() are called before getting
>> page ref,  PageHuge()、compound_head() allows to pass a tail page, 
>> compound_order() will not trigger
>> trigger VM_BUG_ON for tail page, and only lead to reading garbage 
>> data, leading to fail offline, but
>
> We will retry as documented, so offlining should not fail.
>
>> it will not trigger any VM_BUG_ON. So I think 8135d8926c08 is the 
>> introducing commit.
>
> Not for the hugetlb issue you describe I assume. That should be caused by
>
> commit b62b51d2d1593e6590669e781768c3c5a7394eb5
> Author: Kefeng Wang <wangkefeng.wang@huawei.com>
> Date:   Tue Aug 27 19:47:24 2024 +0800
>
>     mm: memory_hotplug: remove head variable in do_migrate_range()
>
>     Patch series "mm: memory_hotplug: improve do_migrate_range()", v3.
>
> So probably we should have both commits as Fixes (one for the hugetlb 
> path, one for the THP path)
>
Yes, b62b51d2d159 ("mm: memory_hotplug: remove head variable in do_migrate_range()") introduced this for hugetlb.

Thanks.
Oscar Salvador March 28, 2025, 1:23 p.m. UTC | #8
On Mon, Mar 24, 2025 at 09:17:50PM +0800, Jinjiang Tu wrote:
> We triggered the below BUG:
> 
... 
> When allocating a hugetlb folio, between the folio is taken from buddy
> and prep_compound_page() is called, start_isolate_page_range() and
> do_migrate_range() is called. When do_migrate_range() scans the head page
> of the hugetlb folio, the compound_head field isn't set, so scans the
> tail page next. And at this time, the compound_head field of tail page is
> set, folio_test_large() is called by tail page, thus triggers VM_BUG_ON().
> 
> To fix it, get folio refcount before calling folio_test_large().
> 
> Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration")
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>

Acked-by: Oscar Salvador <osalvador@suse.de>

As David mentions, I think too that this was introduced in
commit: b62b51d2d1593e65 ("mm: memory_hotplug: remove head variable in
do_migrate_range()")
Andrew Morton March 28, 2025, 11:37 p.m. UTC | #9
On Tue, 25 Mar 2025 20:05:50 +0100 David Hildenbrand <david@redhat.com> wrote:

> On 25.03.25 04:02, Jinjiang Tu wrote:
> > 
> >> Moving that will not make it able to skip the large frozen
> >> (refcount==0, e.g., free hugetlb) folio in the continue/put_folio case
> >> above. Hmmmm ..
> > For free hugetlb, pfn is increased by 1 in each loop. This leads to skip
> > free hugetlb slower.
> 
> Yes. But now I realize that we have the same issue with free buddy pages 
> already (folio_try_get of each individual page :( ).
> 
> >>
> >> We could similarly to dumping folios, snapshot them, so we can read
> >> stable data.
> > extract the code in __dump_page()? But snapshot may lead to
> > do_migrate_range() slower too.
> 
> There is a patch series on the list to do that, but it might take a 
> while to clean that up. Ideally, we'd also jump over free buddy pages. 
> In the future we might have better ways to do that.
> 
> I don't consider this change here really important, but if all it 
> affects is free hugetlb folios, it's not really worth it to have this 
> code around.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 
> But: I suspect 8135d8926c08 is not the introducing commit. Please re-verify.
> 
> We should not CC stable.

Why shouldn't we cc:stable?  A userspace-triggerable BUG?
diff mbox series

Patch

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 16cf9e17077e..f600c26ce5de 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1813,21 +1813,15 @@  static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		page = pfn_to_page(pfn);
 		folio = page_folio(page);
 
-		/*
-		 * No reference or lock is held on the folio, so it might
-		 * be modified concurrently (e.g. split).  As such,
-		 * folio_nr_pages() may read garbage.  This is fine as the outer
-		 * loop will revisit the split folio later.
-		 */
-		if (folio_test_large(folio))
-			pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
-
 		if (!folio_try_get(folio))
 			continue;
 
 		if (unlikely(page_folio(page) != folio))
 			goto put_folio;
 
+		if (folio_test_large(folio))
+			pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1;
+
 		if (folio_test_hwpoison(folio) ||
 		    (folio_test_large(folio) && folio_test_has_hwpoisoned(folio))) {
 			if (WARN_ON(folio_test_lru(folio)))