Message ID | 20250414072737.1698513-1-gavinguo@igalia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/huge_memory: fix dereferencing invalid pmd migration entry | expand |
On 14 Apr 2025, at 3:27, Gavin Guo wrote: > When migrating a THP, concurrent access to the PMD migration entry > during a deferred split scan can lead to a page fault, as illustrated It is an access violation, right? Because pmd_folio(*pmd_migration_entry) does not return a folio address. Page fault made this sounded like not a big issue. > below. To prevent this page fault, it is necessary to check the PMD > migration entry and return early. In this context, there is no need to > use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality > of the target folio. Since the PMD migration entry is locked, it cannot > be served as the target. You mean split_huge_pmd_address() locks the PMD page table, so that page migration cannot proceed, or the THP is locked by migration, so that it cannot be split? The sentence is a little confusing to me. > > BUG: unable to handle page fault for address: ffffea60001db008 > CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 > Call Trace: > <TASK> > try_to_migrate_one+0x28c/0x3730 > rmap_walk_anon+0x4f6/0x770 > unmap_folio+0x196/0x1f0 > split_huge_page_to_list_to_order+0x9f6/0x1560 > deferred_split_scan+0xac5/0x12a0 > shrinker_debugfs_scan_write+0x376/0x470 > full_proxy_write+0x15c/0x220 > vfs_write+0x2fc/0xcb0 > ksys_write+0x146/0x250 > do_syscall_64+0x6a/0x120 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > The bug is found by syzkaller on an internal kernel, then confirmed on > upstream. > > Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") > Cc: stable@vger.kernel.org > Signed-off-by: Gavin Guo <gavinguo@igalia.com> > --- > mm/huge_memory.c | 18 ++++++++++++++---- > 1 file changed, 14 insertions(+), 4 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 2a47682d1ab7..0cb9547dcff2 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3075,6 +3075,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > pmd_t *pmd, bool freeze, struct folio *folio) > { > + bool pmd_migration = is_pmd_migration_entry(*pmd); > + > VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); > VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); > VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); > @@ -3085,10 +3087,18 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > * require a folio to check the PMD against. Otherwise, there > * is a risk of replacing the wrong folio. > */ > - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || > - is_pmd_migration_entry(*pmd)) { > - if (folio && folio != pmd_folio(*pmd)) > - return; > + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) { > + if (folio) { > + /* > + * Do not apply pmd_folio() to a migration entry; and > + * folio lock guarantees that it must be of the wrong > + * folio anyway. Why does the folio lock imply it is a wrong folio? > + */ > + if (pmd_migration) > + return; > + if (folio != pmd_folio(*pmd)) > + return; > + } Why not just if (folio && pmd_migration) return; if (pmd_trans_huge() …) { … } ? Thanks. Best Regards, Yan, Zi
Hi Zi-Yan, Thank you for the comment! On 4/15/25 00:50, Zi Yan wrote: > On 14 Apr 2025, at 3:27, Gavin Guo wrote: > >> When migrating a THP, concurrent access to the PMD migration entry >> during a deferred split scan can lead to a page fault, as illustrated > > It is an access violation, right? Because pmd_folio(*pmd_migration_entry) > does not return a folio address. Page fault made this sounded like not > a big issue. Right, pmd_folio(*pmd_migration_entry) is defined with the following macro: #define pmd_folio(pmd) page_folio(pmd_page(pmd)) page_folio will eventually access compound_head: static __always_inline unsigned long _compound_head(const struct page *page) { unsigned long head = READ_ONCE(page->compound_head); ... } However, given the invalid access address starting with 0xffffea (ffffea60001db008) which is the base address of the struct page. It indicates that the page address calculated from pmd_page is invalid during the THP migration where the pmd migration entry has been set up in set_pmd_migration_entry, for example: do_mbind migrate_pages migrate_pages_sync migration_pages_batch migrate_folio_unmap try_to_migrate try_to_migrate_one rmap_walk_anon set_pmd_migration_entry set_pmd_at > >> below. To prevent this page fault, it is necessary to check the PMD >> migration entry and return early. In this context, there is no need to >> use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality >> of the target folio. Since the PMD migration entry is locked, it cannot >> be served as the target. > > You mean split_huge_pmd_address() locks the PMD page table, so that > page migration cannot proceed, or the THP is locked by migration, > so that it cannot be split? The sentence is a little confusing to me. During the THP migration, the THP folio is locked (folio_trylock). When this THP folio is identified as partially-mapped and inserted into the deferred split queue, the system then scans the queue and attempts to split the folio when memory is under pressure or through the sysfs debug interface. Since the migrated folio remained locked, during the deferred_split_scan, the folio_trylock fails with the migrated folio. As a result, the folio passed to split_huge_pmd_locked ends up being unequal to the folio referenced by the pmd migration entry, indicating the pmd migration folio cannot be the target for splitting and needs to return. An alternative approach is similar to the following: + swp_entry_t entry; + struct folio *dst; if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)) { - if (folio && folio != pmd_folio(*pmd)) - return; + if (folio) { + if (is_pmd_migration_entry(*pmd)) { + entry = pmd_to_swp_entry(*pmd); + dst = pfn_swap_entry_folio(entry); + } else + dst = pmd_folio(*pmd); + + if (folio != dst) + return + } __split_huge_pmd_locked(vma, pmd, address, freeze); However, this extra effort to translate the pmd migration folio is unnecessary. Any ideas of exceptions? > >> >> BUG: unable to handle page fault for address: ffffea60001db008 >> CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE >> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 >> RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 >> Call Trace: >> <TASK> >> try_to_migrate_one+0x28c/0x3730 >> rmap_walk_anon+0x4f6/0x770 >> unmap_folio+0x196/0x1f0 >> split_huge_page_to_list_to_order+0x9f6/0x1560 >> deferred_split_scan+0xac5/0x12a0 >> shrinker_debugfs_scan_write+0x376/0x470 >> full_proxy_write+0x15c/0x220 >> vfs_write+0x2fc/0xcb0 >> ksys_write+0x146/0x250 >> do_syscall_64+0x6a/0x120 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> >> The bug is found by syzkaller on an internal kernel, then confirmed on >> upstream. >> >> Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") >> Cc: stable@vger.kernel.org >> Signed-off-by: Gavin Guo <gavinguo@igalia.com> >> --- >> mm/huge_memory.c | 18 ++++++++++++++---- >> 1 file changed, 14 insertions(+), 4 deletions(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 2a47682d1ab7..0cb9547dcff2 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -3075,6 +3075,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >> void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, >> pmd_t *pmd, bool freeze, struct folio *folio) >> { >> + bool pmd_migration = is_pmd_migration_entry(*pmd); >> + >> VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); >> VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); >> VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); >> @@ -3085,10 +3087,18 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, >> * require a folio to check the PMD against. Otherwise, there >> * is a risk of replacing the wrong folio. >> */ >> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || >> - is_pmd_migration_entry(*pmd)) { >> - if (folio && folio != pmd_folio(*pmd)) >> - return; >> + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) { >> + if (folio) { >> + /* >> + * Do not apply pmd_folio() to a migration entry; and >> + * folio lock guarantees that it must be of the wrong >> + * folio anyway. > > Why does the folio lock imply it is a wrong folio? As explained above. > >> + */ >> + if (pmd_migration) >> + return; >> + if (folio != pmd_folio(*pmd)) >> + return; >> + } > > Why not just > > if (folio && pmd_migration) > return; > > if (pmd_trans_huge() …) { > … > } > ? Do you mean to implement as follows? if (folio && pmd_migration) return; if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { if (folio && folio != pmd_folio(*pmd)) return; } if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) __split_huge_pmd_locked(vma, pmd, address, freeze); To match the current upstream logic, when folio is null, no matter the condition is either pmd_trans_huge, pmd_devmap, or pmd_migration, the __split_huge_pmd_locked must be always executed. Given that, the __split_huge_pmd_locked cannot be put inside if condition for trans_huge and devmap. Likewise, the __split_huge_pmd_locked cannot be called directly without wrapping the if condition checks. What happens if this is not a pmd entry? This will be invoked unconditionally. The original implementation consolidates all the logic into one large if statement with nested if condition check. However, it's more robust and clear. The second one simplifies the structure and can rule out the pmd_migration earlier and doesn't have the big if condition. However, it's trickier and needs extra care and maintenance. > > Thanks. > > Best Regards, > Yan, Zi Thanks!
On 15 Apr 2025, at 6:07, Gavin Guo wrote: > Hi Zi-Yan, > > Thank you for the comment! > > On 4/15/25 00:50, Zi Yan wrote: >> On 14 Apr 2025, at 3:27, Gavin Guo wrote: >> >>> When migrating a THP, concurrent access to the PMD migration entry >>> during a deferred split scan can lead to a page fault, as illustrated >> >> It is an access violation, right? Because pmd_folio(*pmd_migration_entry) >> does not return a folio address. Page fault made this sounded like not >> a big issue. > > Right, pmd_folio(*pmd_migration_entry) is defined with the following > macro: > > #define pmd_folio(pmd) page_folio(pmd_page(pmd)) > > page_folio will eventually access compound_head: > > static __always_inline unsigned long _compound_head(const struct page *page) > { > unsigned long head = READ_ONCE(page->compound_head); > ... > } > > However, given the invalid access address starting with 0xffffea > (ffffea60001db008) which is the base address of the struct page. It > indicates that the page address calculated from pmd_page is invalid > during the THP migration where the pmd migration entry has been set up > in set_pmd_migration_entry, for example: Exactly. Maybe I was not clear. I just want you to update your git commit log to say “… can lead to an invalid address access” instead of “page fault”. > > do_mbind > migrate_pages > migrate_pages_sync > migration_pages_batch > migrate_folio_unmap > try_to_migrate > try_to_migrate_one > rmap_walk_anon > set_pmd_migration_entry > set_pmd_at > >> >>> below. To prevent this page fault, it is necessary to check the PMD >>> migration entry and return early. In this context, there is no need to >>> use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality >>> of the target folio. Since the PMD migration entry is locked, it cannot >>> be served as the target. >> >> You mean split_huge_pmd_address() locks the PMD page table, so that >> page migration cannot proceed, or the THP is locked by migration, >> so that it cannot be split? The sentence is a little confusing to me. > > During the THP migration, the THP folio is locked (folio_trylock). When > this THP folio is identified as partially-mapped and inserted into the > deferred split queue, the system then scans the queue and attempts to > split the folio when memory is under pressure or through the sysfs debug > interface. Since the migrated folio remained locked, during the > deferred_split_scan, the folio_trylock fails with the migrated folio. As Wait. If folio_trylock() failed, how could split_folio() is called in deferred_split_scan()? Your call trace has split_folio(), which means the folio is locked by deferred_split_scan() already, not migration. What am I missing? > a result, the folio passed to split_huge_pmd_locked ends up being > unequal to the folio referenced by the pmd migration entry, indicating > the pmd migration folio cannot be the target for splitting and needs to > return. > > An alternative approach is similar to the following: > > + swp_entry_t entry; > + struct folio *dst; > if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || > is_pmd_migration_entry(*pmd)) { > - if (folio && folio != pmd_folio(*pmd)) > - return; > + if (folio) { > + if (is_pmd_migration_entry(*pmd)) { > + entry = pmd_to_swp_entry(*pmd); > + dst = pfn_swap_entry_folio(entry); > + } else > + dst = pmd_folio(*pmd); > + > + if (folio != dst) > + return > + } > __split_huge_pmd_locked(vma, pmd, address, freeze); > > However, this extra effort to translate the pmd migration folio is > unnecessary. Any ideas of exceptions? I get this part. Your assumption is that split_huge_pmd_address() with folio passed in is only called by THP split. It will be better to state your assumption explicitly in the comment. It is unclear to people why “folio lock guarantees … wrong folio”. > >> >>> >>> BUG: unable to handle page fault for address: ffffea60001db008 >>> CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE >>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 >>> RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 >>> Call Trace: >>> <TASK> >>> try_to_migrate_one+0x28c/0x3730 >>> rmap_walk_anon+0x4f6/0x770 >>> unmap_folio+0x196/0x1f0 >>> split_huge_page_to_list_to_order+0x9f6/0x1560 >>> deferred_split_scan+0xac5/0x12a0 >>> shrinker_debugfs_scan_write+0x376/0x470 >>> full_proxy_write+0x15c/0x220 >>> vfs_write+0x2fc/0xcb0 >>> ksys_write+0x146/0x250 >>> do_syscall_64+0x6a/0x120 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> >>> The bug is found by syzkaller on an internal kernel, then confirmed on >>> upstream. >>> >>> Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") >>> Cc: stable@vger.kernel.org >>> Signed-off-by: Gavin Guo <gavinguo@igalia.com> >>> --- >>> mm/huge_memory.c | 18 ++++++++++++++---- >>> 1 file changed, 14 insertions(+), 4 deletions(-) >>> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>> index 2a47682d1ab7..0cb9547dcff2 100644 >>> --- a/mm/huge_memory.c >>> +++ b/mm/huge_memory.c >>> @@ -3075,6 +3075,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >>> void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, >>> pmd_t *pmd, bool freeze, struct folio *folio) >>> { >>> + bool pmd_migration = is_pmd_migration_entry(*pmd); >>> + >>> VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); >>> VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); >>> VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); >>> @@ -3085,10 +3087,18 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, >>> * require a folio to check the PMD against. Otherwise, there >>> * is a risk of replacing the wrong folio. >>> */ >>> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || >>> - is_pmd_migration_entry(*pmd)) { >>> - if (folio && folio != pmd_folio(*pmd)) >>> - return; >>> + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) { >>> + if (folio) { >>> + /* >>> + * Do not apply pmd_folio() to a migration entry; and >>> + * folio lock guarantees that it must be of the wrong >>> + * folio anyway. >> >> Why does the folio lock imply it is a wrong folio? > > As explained above. > >> >>> + */ >>> + if (pmd_migration) >>> + return; >>> + if (folio != pmd_folio(*pmd)) >>> + return; >>> + } >> >> Why not just >> >> if (folio && pmd_migration) >> return; >> >> if (pmd_trans_huge() …) { >> … >> } >> ? > > Do you mean to implement as follows? > > if (folio && pmd_migration) > return; > > if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { > if (folio && folio != pmd_folio(*pmd)) > return; > } > > if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) > __split_huge_pmd_locked(vma, pmd, address, freeze); Yes. > > To match the current upstream logic, when folio is null, no matter the > condition is either pmd_trans_huge, pmd_devmap, or pmd_migration, the > __split_huge_pmd_locked must be always executed. Given that, the > __split_huge_pmd_locked cannot be put inside if condition for trans_huge > and devmap. Likewise, the __split_huge_pmd_locked cannot be called > directly without wrapping the if condition checks. What happens if this > is not a pmd entry? This will be invoked unconditionally. > > The original implementation consolidates all the logic into one large if > statement with nested if condition check. However, it's more robust and > clear. The second one simplifies the structure and can rule out the > pmd_migration earlier and doesn't have the big if condition. However, > it's trickier and needs extra care and maintenance. IMHO, fewer indentation is always better. But I have no strong opinion on it. Anyway, we need to figure out why both THP migration and deferred_split_scan() hold the THP lock first, which sounds impossible to me. Or some other execution interleaving is happening. Thanks. Best Regards, Yan, Zi
On 14.04.25 09:27, Gavin Guo wrote: > When migrating a THP, concurrent access to the PMD migration entry > during a deferred split scan can lead to a page fault, as illustrated > below. To prevent this page fault, it is necessary to check the PMD > migration entry and return early. In this context, there is no need to > use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality > of the target folio. Since the PMD migration entry is locked, it cannot > be served as the target. > > BUG: unable to handle page fault for address: ffffea60001db008 > CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 > Call Trace: > <TASK> > try_to_migrate_one+0x28c/0x3730 > rmap_walk_anon+0x4f6/0x770 > unmap_folio+0x196/0x1f0 > split_huge_page_to_list_to_order+0x9f6/0x1560 > deferred_split_scan+0xac5/0x12a0 > shrinker_debugfs_scan_write+0x376/0x470 > full_proxy_write+0x15c/0x220 > vfs_write+0x2fc/0xcb0 > ksys_write+0x146/0x250 > do_syscall_64+0x6a/0x120 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > The bug is found by syzkaller on an internal kernel, then confirmed on > upstream. > > Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") > Cc: stable@vger.kernel.org > Signed-off-by: Gavin Guo <gavinguo@igalia.com> > --- > mm/huge_memory.c | 18 ++++++++++++++---- > 1 file changed, 14 insertions(+), 4 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 2a47682d1ab7..0cb9547dcff2 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3075,6 +3075,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > pmd_t *pmd, bool freeze, struct folio *folio) > { > + bool pmd_migration = is_pmd_migration_entry(*pmd); > + > VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); > VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); > VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); > @@ -3085,10 +3087,18 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > * require a folio to check the PMD against. Otherwise, there > * is a risk of replacing the wrong folio. > */ > - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || > - is_pmd_migration_entry(*pmd)) { > - if (folio && folio != pmd_folio(*pmd)) > - return; Why not something like struct folio *entry_folio; if (folio) { if (is_pmd_migration_entry(*pmd)) entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); else entry_folio = pmd_folio(*pmd)); if (folio != entry_folio) return; } (can likely be cleaned up a bit by moving stuff into a separate function)
On Mon, 14 Apr 2025, Gavin Guo wrote: > When migrating a THP, concurrent access to the PMD migration entry > during a deferred split scan can lead to a page fault, as illustrated > below. To prevent this page fault, it is necessary to check the PMD > migration entry and return early. In this context, there is no need to > use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality > of the target folio. Since the PMD migration entry is locked, it cannot > be served as the target. > > BUG: unable to handle page fault for address: ffffea60001db008 > CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 > Call Trace: > <TASK> > try_to_migrate_one+0x28c/0x3730 > rmap_walk_anon+0x4f6/0x770 > unmap_folio+0x196/0x1f0 > split_huge_page_to_list_to_order+0x9f6/0x1560 > deferred_split_scan+0xac5/0x12a0 > shrinker_debugfs_scan_write+0x376/0x470 > full_proxy_write+0x15c/0x220 > vfs_write+0x2fc/0xcb0 > ksys_write+0x146/0x250 > do_syscall_64+0x6a/0x120 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > The bug is found by syzkaller on an internal kernel, then confirmed on > upstream. > > Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") > Cc: stable@vger.kernel.org > Signed-off-by: Gavin Guo <gavinguo@igalia.com> Acked-by: Hugh Dickins <hughd@google.com> > --- > mm/huge_memory.c | 18 ++++++++++++++---- > 1 file changed, 14 insertions(+), 4 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 2a47682d1ab7..0cb9547dcff2 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3075,6 +3075,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > pmd_t *pmd, bool freeze, struct folio *folio) > { > + bool pmd_migration = is_pmd_migration_entry(*pmd); > + > VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); > VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); > VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); > @@ -3085,10 +3087,18 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > * require a folio to check the PMD against. Otherwise, there > * is a risk of replacing the wrong folio. > */ > - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || > - is_pmd_migration_entry(*pmd)) { > - if (folio && folio != pmd_folio(*pmd)) > - return; > + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) { > + if (folio) { > + /* > + * Do not apply pmd_folio() to a migration entry; and > + * folio lock guarantees that it must be of the wrong > + * folio anyway. > + */ > + if (pmd_migration) > + return; > + if (folio != pmd_folio(*pmd)) > + return; > + } > __split_huge_pmd_locked(vma, pmd, address, freeze); > } > } > > base-commit: a24588245776dafc227243a01bfbeb8a59bafba9 > -- > 2.43.0
On Mon, 14 Apr 2025, Zi Yan wrote: > On 14 Apr 2025, at 3:27, Gavin Guo wrote: > > > When migrating a THP, concurrent access to the PMD migration entry > > during a deferred split scan can lead to a page fault, as illustrated > > It is an access violation, right? Because pmd_folio(*pmd_migration_entry) > does not return a folio address. Page fault made this sounded like not > a big issue. > > > below. To prevent this page fault, it is necessary to check the PMD > > migration entry and return early. In this context, there is no need to > > use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality > > of the target folio. Since the PMD migration entry is locked, it cannot > > be served as the target. > > You mean split_huge_pmd_address() locks the PMD page table, so that > page migration cannot proceed, or the THP is locked by migration, > so that it cannot be split? The sentence is a little confusing to me. No, split_huge_pmd_address() locks nothing. But its caller holds the folio lock on this folio (as split_huge_pmd_locked() asserts with a VM_WARN_ON_ONCE); and page migration holds folio lock on its folio (as various swapops.h functions assert with BUG_ON). So any PMD migration entry found here cannot be for the folio which split_huge_pmd_address() is passing down. (And even if the impossible did occur, what woud we want to do? Skip it as the patch does.) > > > > > BUG: unable to handle page fault for address: ffffea60001db008 > > CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > > RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 > > Call Trace: > > <TASK> > > try_to_migrate_one+0x28c/0x3730 > > rmap_walk_anon+0x4f6/0x770 > > unmap_folio+0x196/0x1f0 > > split_huge_page_to_list_to_order+0x9f6/0x1560 > > deferred_split_scan+0xac5/0x12a0 > > shrinker_debugfs_scan_write+0x376/0x470 > > full_proxy_write+0x15c/0x220 > > vfs_write+0x2fc/0xcb0 > > ksys_write+0x146/0x250 > > do_syscall_64+0x6a/0x120 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > The bug is found by syzkaller on an internal kernel, then confirmed on > > upstream. > > > > Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") > > Cc: stable@vger.kernel.org > > Signed-off-by: Gavin Guo <gavinguo@igalia.com> > > --- > > mm/huge_memory.c | 18 ++++++++++++++---- > > 1 file changed, 14 insertions(+), 4 deletions(-) > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index 2a47682d1ab7..0cb9547dcff2 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -3075,6 +3075,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > > pmd_t *pmd, bool freeze, struct folio *folio) > > { > > + bool pmd_migration = is_pmd_migration_entry(*pmd); > > + > > VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); > > VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); > > VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); > > @@ -3085,10 +3087,18 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > > * require a folio to check the PMD against. Otherwise, there > > * is a risk of replacing the wrong folio. > > */ > > - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || > > - is_pmd_migration_entry(*pmd)) { > > - if (folio && folio != pmd_folio(*pmd)) > > - return; > > + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) { > > + if (folio) { > > + /* > > + * Do not apply pmd_folio() to a migration entry; and > > + * folio lock guarantees that it must be of the wrong > > + * folio anyway. > > Why does the folio lock imply it is a wrong folio? Because you cannot have two tasks holding folio lock on the same folio at the same time. So therefore it is a different ("wrong") folio. > > > + */ > > + if (pmd_migration) > > + return; > > + if (folio != pmd_folio(*pmd)) > > + return; > > + } > > Why not just > > if (folio && pmd_migration) > return; That looks nicer, less indentation, I agree. But Gavin's patch is keeping the relevant check next to the "pmd_folio(*pmd)" to be avoided: also good. I have no opinion which is the better. Hugh > > if (pmd_trans_huge() …) { > … > } > ? > > Thanks. > > Best Regards, > Yan, Zi >
On Tue, 15 Apr 2025, Zi Yan wrote: > > Anyway, we need to figure out why both THP migration and deferred_split_scan() > hold the THP lock first, which sounds impossible to me. Or some other execution > interleaving is happening. I think perhaps you're missing that an anon_vma lookup points to a location which may contain the folio of interest, but might instead contain another folio: and weeding out those other folios is precisely what the "folio != pmd_folio((*pmd)" check (and the "risk of replacing the wrong folio" comment a few lines above it) is for. The "BUG: unable to handle page fault" comes about because that other folio might actually be being migrated at this time, so we encounter a PMD migration entry instead of a valid PMD entry. But if it's the folio we're looking for, our folio lock excludes a racing migration, so it would never be a PMD migration entry for our folio. Hugh
On Wed, 16 Apr 2025, David Hildenbrand wrote: > > Why not something like > > struct folio *entry_folio; > > if (folio) { > if (is_pmd_migration_entry(*pmd)) > entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); > else > entry_folio = pmd_folio(*pmd)); > > if (folio != entry_folio) > return; > } My own preference is to not add unnecessary code: if folio and pmd_migration entry, we're not interested in entry_folio. But yes it could be written in lots of other ways. Hugh
On 17.04.25 07:36, Hugh Dickins wrote: > On Wed, 16 Apr 2025, David Hildenbrand wrote: >> >> Why not something like >> >> struct folio *entry_folio; >> >> if (folio) { >> if (is_pmd_migration_entry(*pmd)) >> entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); >> else >> entry_folio = pmd_folio(*pmd)); >> >> if (folio != entry_folio) >> return; >> } > > My own preference is to not add unnecessary code: > if folio and pmd_migration entry, we're not interested in entry_folio. > But yes it could be written in lots of other ways. While I don't disagree about "not adding unnecessary code" in general, in this particular case just looking the folio up properly might be the better alternative to reasoning about locking rules with conditional input parameters :)
On 17.04.25 09:18, David Hildenbrand wrote: > On 17.04.25 07:36, Hugh Dickins wrote: >> On Wed, 16 Apr 2025, David Hildenbrand wrote: >>> >>> Why not something like >>> >>> struct folio *entry_folio; >>> >>> if (folio) { >>> if (is_pmd_migration_entry(*pmd)) >>> entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); >>> else >>> entry_folio = pmd_folio(*pmd)); >>> >>> if (folio != entry_folio) >>> return; >>> } >> >> My own preference is to not add unnecessary code: >> if folio and pmd_migration entry, we're not interested in entry_folio. >> But yes it could be written in lots of other ways. > > While I don't disagree about "not adding unnecessary code" in general, > in this particular case just looking the folio up properly might be the > better alternative to reasoning about locking rules with conditional > input parameters :) > FWIW, I was wondering if we can rework that code, letting the caller to the checking and getting rid of the folio parameter. Something like this (incomplete, just to discuss if we could move the TTU_SPLIT_HUGE_PMD handling). diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2a47682d1ab77..754aa3103e8bf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3075,22 +3075,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, bool freeze, struct folio *folio) { - VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); - VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); - VM_BUG_ON(freeze && !folio); - /* - * When the caller requests to set up a migration entry, we - * require a folio to check the PMD against. Otherwise, there - * is a risk of replacing the wrong folio. - */ if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || - is_pmd_migration_entry(*pmd)) { - if (folio && folio != pmd_folio(*pmd)) - return; + is_pmd_migration_entry(*pmd)) __split_huge_pmd_locked(vma, pmd, address, freeze); - } } void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, diff --git a/mm/rmap.c b/mm/rmap.c index 67bb273dfb80d..bf0320b03d615 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2291,13 +2291,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, if (flags & TTU_SYNC) pvmw.flags = PVMW_SYNC; - /* - * unmap_page() in mm/huge_memory.c is the only user of migration with - * TTU_SPLIT_HUGE_PMD and it wants to freeze. - */ - if (flags & TTU_SPLIT_HUGE_PMD) - split_huge_pmd_address(vma, address, true, folio); - /* * For THP, we have to assume the worse case ie pmd for invalidation. * For hugetlb, it could be much worse if we need to do pud @@ -2326,6 +2319,14 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION /* PMD-mapped THP migration entry */ if (!pvmw.pte) { + if (flags & TTU_SPLIT_HUGE_PMD) { + split_huge_pmd_locked(vma, pmvw.address, pvmw.pmd, + true, NULL); + ret = false; + page_vma_mapped_walk_done(&pvmw); + break; + } + subpage = folio_page(folio, pmd_pfn(*pvmw.pmd) - folio_pfn(folio)); VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
On 17.04.25 10:07, David Hildenbrand wrote: > On 17.04.25 09:18, David Hildenbrand wrote: >> On 17.04.25 07:36, Hugh Dickins wrote: >>> On Wed, 16 Apr 2025, David Hildenbrand wrote: >>>> >>>> Why not something like >>>> >>>> struct folio *entry_folio; >>>> >>>> if (folio) { >>>> if (is_pmd_migration_entry(*pmd)) >>>> entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); >>>> else >>>> entry_folio = pmd_folio(*pmd)); >>>> >>>> if (folio != entry_folio) >>>> return; >>>> } >>> >>> My own preference is to not add unnecessary code: >>> if folio and pmd_migration entry, we're not interested in entry_folio. >>> But yes it could be written in lots of other ways. >> >> While I don't disagree about "not adding unnecessary code" in general, >> in this particular case just looking the folio up properly might be the >> better alternative to reasoning about locking rules with conditional >> input parameters :) >> > > FWIW, I was wondering if we can rework that code, letting the caller to the > checking and getting rid of the folio parameter. Something like this (incomplete, just to > discuss if we could move the TTU_SPLIT_HUGE_PMD handling). > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 2a47682d1ab77..754aa3103e8bf 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3075,22 +3075,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > pmd_t *pmd, bool freeze, struct folio *folio) > { > - VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); > VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); > - VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); > - VM_BUG_ON(freeze && !folio); > > - /* > - * When the caller requests to set up a migration entry, we > - * require a folio to check the PMD against. Otherwise, there > - * is a risk of replacing the wrong folio. > - */ > if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || > - is_pmd_migration_entry(*pmd)) { > - if (folio && folio != pmd_folio(*pmd)) > - return; > + is_pmd_migration_entry(*pmd)) > __split_huge_pmd_locked(vma, pmd, address, freeze); > - } > } > > void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > diff --git a/mm/rmap.c b/mm/rmap.c > index 67bb273dfb80d..bf0320b03d615 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -2291,13 +2291,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, > if (flags & TTU_SYNC) > pvmw.flags = PVMW_SYNC; > > - /* > - * unmap_page() in mm/huge_memory.c is the only user of migration with > - * TTU_SPLIT_HUGE_PMD and it wants to freeze. > - */ > - if (flags & TTU_SPLIT_HUGE_PMD) > - split_huge_pmd_address(vma, address, true, folio); > - > /* > * For THP, we have to assume the worse case ie pmd for invalidation. > * For hugetlb, it could be much worse if we need to do pud > @@ -2326,6 +2319,14 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, > #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION > /* PMD-mapped THP migration entry */ > if (!pvmw.pte) { > + if (flags & TTU_SPLIT_HUGE_PMD) { > + split_huge_pmd_locked(vma, pmvw.address, pvmw.pmd, > + true, NULL); > + ret = false; > + page_vma_mapped_walk_done(&pvmw); > + break; > + } > + > subpage = folio_page(folio, > pmd_pfn(*pvmw.pmd) - folio_pfn(folio)); > VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) || > > Likely, we'd have to adjust the CONFIG_ARCH_ENABLE_THP_MIGRATION coverage here, for TTU_SPLIT_HUGE_PMD to get handled even without that. Just an idea.
On Thu, 17 Apr 2025, David Hildenbrand wrote: > On 17.04.25 09:18, David Hildenbrand wrote: > > On 17.04.25 07:36, Hugh Dickins wrote: > >> On Wed, 16 Apr 2025, David Hildenbrand wrote: > >>> > >>> Why not something like > >>> > >>> struct folio *entry_folio; > >>> > >>> if (folio) { > >>> if (is_pmd_migration_entry(*pmd)) > >>> entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); > >>> else > >>> entry_folio = pmd_folio(*pmd)); > >>> > >>> if (folio != entry_folio) > >>> return; > >>> } > >> > >> My own preference is to not add unnecessary code: > >> if folio and pmd_migration entry, we're not interested in entry_folio. > >> But yes it could be written in lots of other ways. > > > > While I don't disagree about "not adding unnecessary code" in general, > > in this particular case just looking the folio up properly might be the > > better alternative to reasoning about locking rules with conditional > > input parameters :) > > > > FWIW, I was wondering if we can rework that code, letting the caller to the > checking and getting rid of the folio parameter. Something like this > (incomplete, just to > discuss if we could move the TTU_SPLIT_HUGE_PMD handling). Yes, I too dislike the folio parameter used for a single case, and agree it's better for the caller who chose pmd to check that *pmd fits the folio. I haven't checked your code below, but it looks like a much better way to proceed, using the page_vma_mapped_walk() to get pmd lock and check; and cutting out two or more layers of split_huge_pmd obscurity. Way to go. However... what we want right now is a fix that can easily go to stable: the rearrangements here in 6.15-rc mean, I think, that whatever goes into the current tree will have to be placed differently for stable, no seamless backports; but Gavin's patch (reworked if you insist) can be adapted to stable (differently for different releases) more more easily than the future direction you're proposing here. (Hmm, that may be another reason for preferring the reasoning by folio lock: forgive me if I'm misremembering, but didn't those page migration swapops get renamed, some time around 5.11?) Hugh > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 2a47682d1ab77..754aa3103e8bf 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3075,22 +3075,11 @@ static void __split_huge_pmd_locked(struct > vm_area_struct *vma, pmd_t *pmd, > void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, > pmd_t *pmd, bool freeze, struct folio *folio) > { > - VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); > VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); > - VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); > - VM_BUG_ON(freeze && !folio); > - /* > - * When the caller requests to set up a migration entry, we > - * require a folio to check the PMD against. Otherwise, there > - * is a risk of replacing the wrong folio. > - */ > if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || > - is_pmd_migration_entry(*pmd)) { > - if (folio && folio != pmd_folio(*pmd)) > - return; > + is_pmd_migration_entry(*pmd)) > __split_huge_pmd_locked(vma, pmd, address, freeze); > - } > } > > void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > diff --git a/mm/rmap.c b/mm/rmap.c > index 67bb273dfb80d..bf0320b03d615 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -2291,13 +2291,6 @@ static bool try_to_migrate_one(struct folio *folio, > struct vm_area_struct *vma, > if (flags & TTU_SYNC) > pvmw.flags = PVMW_SYNC; > - /* > - * unmap_page() in mm/huge_memory.c is the only user of migration with > - * TTU_SPLIT_HUGE_PMD and it wants to freeze. > - */ > - if (flags & TTU_SPLIT_HUGE_PMD) > - split_huge_pmd_address(vma, address, true, folio); > - > /* > * For THP, we have to assume the worse case ie pmd for invalidation. > * For hugetlb, it could be much worse if we need to do pud > @@ -2326,6 +2319,14 @@ static bool try_to_migrate_one(struct folio *folio, > struct vm_area_struct *vma, > #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION > /* PMD-mapped THP migration entry */ > if (!pvmw.pte) { > + if (flags & TTU_SPLIT_HUGE_PMD) { > + split_huge_pmd_locked(vma, pmvw.address, > pvmw.pmd, > + true, NULL); > + ret = false; > + page_vma_mapped_walk_done(&pvmw); > + break; > + } > + > subpage = folio_page(folio, > pmd_pfn(*pvmw.pmd) - folio_pfn(folio)); > VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) || > > > -- > Cheers, > > David / dhildenb
On 17.04.25 10:55, Hugh Dickins wrote: > On Thu, 17 Apr 2025, David Hildenbrand wrote: >> On 17.04.25 09:18, David Hildenbrand wrote: >>> On 17.04.25 07:36, Hugh Dickins wrote: >>>> On Wed, 16 Apr 2025, David Hildenbrand wrote: >>>>> >>>>> Why not something like >>>>> >>>>> struct folio *entry_folio; >>>>> >>>>> if (folio) { >>>>> if (is_pmd_migration_entry(*pmd)) >>>>> entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); >>>>> else >>>>> entry_folio = pmd_folio(*pmd)); >>>>> >>>>> if (folio != entry_folio) >>>>> return; >>>>> } >>>> >>>> My own preference is to not add unnecessary code: >>>> if folio and pmd_migration entry, we're not interested in entry_folio. >>>> But yes it could be written in lots of other ways. >>> >>> While I don't disagree about "not adding unnecessary code" in general, >>> in this particular case just looking the folio up properly might be the >>> better alternative to reasoning about locking rules with conditional >>> input parameters :) >>> >> >> FWIW, I was wondering if we can rework that code, letting the caller to the >> checking and getting rid of the folio parameter. Something like this >> (incomplete, just to >> discuss if we could move the TTU_SPLIT_HUGE_PMD handling). > > Yes, I too dislike the folio parameter used for a single case, and agree > it's better for the caller who chose pmd to check that *pmd fits the folio. > > I haven't checked your code below, but it looks like a much better way > to proceed, using the page_vma_mapped_walk() to get pmd lock and check; > and cutting out two or more layers of split_huge_pmd obscurity. > > Way to go. However... what we want right now is a fix that can easily > go to stable: the rearrangements here in 6.15-rc mean, I think, that > whatever goes into the current tree will have to be placed differently > for stable, no seamless backports; but Gavin's patch (reworked if you > insist) can be adapted to stable (differently for different releases) > more more easily than the future direction you're proposing here. I'm fine with going with the current patch and looking into cleaning it up properly (if possible). So for this patch Acked-by: David Hildenbrand <david@redhat.com> @Gavin, can you look into cleaning that up? > > (Hmm, that may be another reason for preferring the reasoning by > folio lock: forgive me if I'm misremembering, but didn't those > page migration swapops get renamed, some time around 5.11?) I remember that we did something to PTE handling stuff in the context of PTE markers. But things keep changing all of the time .. :)
On 4/17/25 17:04, David Hildenbrand wrote: > On 17.04.25 10:55, Hugh Dickins wrote: >> On Thu, 17 Apr 2025, David Hildenbrand wrote: >>> On 17.04.25 09:18, David Hildenbrand wrote: >>>> On 17.04.25 07:36, Hugh Dickins wrote: >>>>> On Wed, 16 Apr 2025, David Hildenbrand wrote: >>>>>> >>>>>> Why not something like >>>>>> >>>>>> struct folio *entry_folio; >>>>>> >>>>>> if (folio) { >>>>>> if (is_pmd_migration_entry(*pmd)) >>>>>> entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); >>>>>> else >>>>>> entry_folio = pmd_folio(*pmd)); >>>>>> >>>>>> if (folio != entry_folio) >>>>>> return; >>>>>> } >>>>> >>>>> My own preference is to not add unnecessary code: >>>>> if folio and pmd_migration entry, we're not interested in entry_folio. >>>>> But yes it could be written in lots of other ways. >>>> >>>> While I don't disagree about "not adding unnecessary code" in general, >>>> in this particular case just looking the folio up properly might be the >>>> better alternative to reasoning about locking rules with conditional >>>> input parameters :) >>>> >>> >>> FWIW, I was wondering if we can rework that code, letting the caller >>> to the >>> checking and getting rid of the folio parameter. Something like this >>> (incomplete, just to >>> discuss if we could move the TTU_SPLIT_HUGE_PMD handling). >> >> Yes, I too dislike the folio parameter used for a single case, and agree >> it's better for the caller who chose pmd to check that *pmd fits the >> folio. >> >> I haven't checked your code below, but it looks like a much better way >> to proceed, using the page_vma_mapped_walk() to get pmd lock and check; >> and cutting out two or more layers of split_huge_pmd obscurity. >> >> Way to go. However... what we want right now is a fix that can easily >> go to stable: the rearrangements here in 6.15-rc mean, I think, that >> whatever goes into the current tree will have to be placed differently >> for stable, no seamless backports; but Gavin's patch (reworked if you >> insist) can be adapted to stable (differently for different releases) >> more more easily than the future direction you're proposing here. > > I'm fine with going with the current patch and looking into cleaning it > up properly (if possible). > > So for this patch > > Acked-by: David Hildenbrand <david@redhat.com> > > @Gavin, can you look into cleaning that up? Thank you for your review. Before I begin the cleanup, could you please confirm the following action items: Zi Yan's suggestions for the patch are: 1. Replace the page fault with an invalid address access in the commit description. 2. Simplify the nested if-statements into a single if-statement to reduce indentation. David, based on your comment, I understand that you are recommending the entry_folio implementation. Also, from your discussion with Hugh, it appears you agreed with my original approach of returning early when encountering a PMD migration entry, thereby avoiding unnecessary checks. Is that correct? If so, I will keep the current logic. Do you have any additional cleanup suggestions? I will start the cleanup work after confirmation. > >> >> (Hmm, that may be another reason for preferring the reasoning by >> folio lock: forgive me if I'm misremembering, but didn't those >> page migration swapops get renamed, some time around 5.11?) > > I remember that we did something to PTE handling stuff in the context of > PTE markers. But things keep changing all of the time .. :) >
On 17 Apr 2025, at 7:21, Gavin Guo wrote: > On 4/17/25 17:04, David Hildenbrand wrote: >> On 17.04.25 10:55, Hugh Dickins wrote: >>> On Thu, 17 Apr 2025, David Hildenbrand wrote: >>>> On 17.04.25 09:18, David Hildenbrand wrote: >>>>> On 17.04.25 07:36, Hugh Dickins wrote: >>>>>> On Wed, 16 Apr 2025, David Hildenbrand wrote: >>>>>>> >>>>>>> Why not something like >>>>>>> >>>>>>> struct folio *entry_folio; >>>>>>> >>>>>>> if (folio) { >>>>>>> if (is_pmd_migration_entry(*pmd)) >>>>>>> entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); >>>>>>> else >>>>>>> entry_folio = pmd_folio(*pmd)); >>>>>>> >>>>>>> if (folio != entry_folio) >>>>>>> return; >>>>>>> } >>>>>> >>>>>> My own preference is to not add unnecessary code: >>>>>> if folio and pmd_migration entry, we're not interested in entry_folio. >>>>>> But yes it could be written in lots of other ways. >>>>> >>>>> While I don't disagree about "not adding unnecessary code" in general, >>>>> in this particular case just looking the folio up properly might be the >>>>> better alternative to reasoning about locking rules with conditional >>>>> input parameters :) >>>>> >>>> >>>> FWIW, I was wondering if we can rework that code, letting the caller to the >>>> checking and getting rid of the folio parameter. Something like this >>>> (incomplete, just to >>>> discuss if we could move the TTU_SPLIT_HUGE_PMD handling). >>> >>> Yes, I too dislike the folio parameter used for a single case, and agree >>> it's better for the caller who chose pmd to check that *pmd fits the folio. >>> >>> I haven't checked your code below, but it looks like a much better way >>> to proceed, using the page_vma_mapped_walk() to get pmd lock and check; >>> and cutting out two or more layers of split_huge_pmd obscurity. >>> >>> Way to go. However... what we want right now is a fix that can easily >>> go to stable: the rearrangements here in 6.15-rc mean, I think, that >>> whatever goes into the current tree will have to be placed differently >>> for stable, no seamless backports; but Gavin's patch (reworked if you >>> insist) can be adapted to stable (differently for different releases) >>> more more easily than the future direction you're proposing here. >> >> I'm fine with going with the current patch and looking into cleaning it up properly (if possible). >> >> So for this patch >> >> Acked-by: David Hildenbrand <david@redhat.com> >> >> @Gavin, can you look into cleaning that up? > > Thank you for your review. Before I begin the cleanup, could you please > confirm the following action items: > > Zi Yan's suggestions for the patch are: > 1. Replace the page fault with an invalid address access in the commit > description. > > 2. Simplify the nested if-statements into a single if-statement to > reduce indentation. 3. Can you please add Huge’s explanation below in the commit log? That clarifies the issue. Thank you for the fix. “ an anon_vma lookup points to a location which may contain the folio of interest, but might instead contain another folio: and weeding out those other folios is precisely what the "folio != pmd_folio((*pmd)" check (and the "risk of replacing the wrong folio" comment a few lines above it) is for. ” With that, Acked-by: Zi Yan <ziy@nvidia.com> > > David, based on your comment, I understand that you are recommending the > entry_folio implementation. Also, from your discussion with Hugh, it > appears you agreed with my original approach of returning early when > encountering a PMD migration entry, thereby avoiding unnecessary checks. > Is that correct? If so, I will keep the current logic. Do you have any > additional cleanup suggestions? > > I will start the cleanup work after confirmation. > >> >>> >>> (Hmm, that may be another reason for preferring the reasoning by >>> folio lock: forgive me if I'm misremembering, but didn't those >>> page migration swapops get renamed, some time around 5.11?) >> >> I remember that we did something to PTE handling stuff in the context of PTE markers. But things keep changing all of the time .. :) >> Best Regards, Yan, Zi
On 17.04.25 13:21, Gavin Guo wrote: > On 4/17/25 17:04, David Hildenbrand wrote: >> On 17.04.25 10:55, Hugh Dickins wrote: >>> On Thu, 17 Apr 2025, David Hildenbrand wrote: >>>> On 17.04.25 09:18, David Hildenbrand wrote: >>>>> On 17.04.25 07:36, Hugh Dickins wrote: >>>>>> On Wed, 16 Apr 2025, David Hildenbrand wrote: >>>>>>> >>>>>>> Why not something like >>>>>>> >>>>>>> struct folio *entry_folio; >>>>>>> >>>>>>> if (folio) { >>>>>>> if (is_pmd_migration_entry(*pmd)) >>>>>>> entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); >>>>>>> else >>>>>>> entry_folio = pmd_folio(*pmd)); >>>>>>> >>>>>>> if (folio != entry_folio) >>>>>>> return; >>>>>>> } >>>>>> >>>>>> My own preference is to not add unnecessary code: >>>>>> if folio and pmd_migration entry, we're not interested in entry_folio. >>>>>> But yes it could be written in lots of other ways. >>>>> >>>>> While I don't disagree about "not adding unnecessary code" in general, >>>>> in this particular case just looking the folio up properly might be the >>>>> better alternative to reasoning about locking rules with conditional >>>>> input parameters :) >>>>> >>>> >>>> FWIW, I was wondering if we can rework that code, letting the caller >>>> to the >>>> checking and getting rid of the folio parameter. Something like this >>>> (incomplete, just to >>>> discuss if we could move the TTU_SPLIT_HUGE_PMD handling). >>> >>> Yes, I too dislike the folio parameter used for a single case, and agree >>> it's better for the caller who chose pmd to check that *pmd fits the >>> folio. >>> >>> I haven't checked your code below, but it looks like a much better way >>> to proceed, using the page_vma_mapped_walk() to get pmd lock and check; >>> and cutting out two or more layers of split_huge_pmd obscurity. >>> >>> Way to go. However... what we want right now is a fix that can easily >>> go to stable: the rearrangements here in 6.15-rc mean, I think, that >>> whatever goes into the current tree will have to be placed differently >>> for stable, no seamless backports; but Gavin's patch (reworked if you >>> insist) can be adapted to stable (differently for different releases) >>> more more easily than the future direction you're proposing here. >> >> I'm fine with going with the current patch and looking into cleaning it >> up properly (if possible). >> >> So for this patch >> >> Acked-by: David Hildenbrand <david@redhat.com> >> >> @Gavin, can you look into cleaning that up? > > Thank you for your review. Before I begin the cleanup, could you please > confirm the following action items: > > Zi Yan's suggestions for the patch are: > 1. Replace the page fault with an invalid address access in the commit > description. Yes, that makes sense. > > 2. Simplify the nested if-statements into a single if-statement to > reduce indentation. > > David, based on your comment, I understand that you are recommending the > entry_folio implementation. Also, from your discussion with Hugh, it > appears you agreed with my original approach of returning early when > encountering a PMD migration entry, thereby avoiding unnecessary checks. > Is that correct? If so, I will keep the current logic. Do you have any > additional cleanup suggestions? Yes, the current patch is okay for upstream+stable, but we should look into cleaning that up. See the cleanup RFC patch I posted where we remove the folio check completely. Please let me know if you need more information.
On 4/17/25 19:32, Zi Yan wrote: > On 17 Apr 2025, at 7:21, Gavin Guo wrote: > >> On 4/17/25 17:04, David Hildenbrand wrote: >>> On 17.04.25 10:55, Hugh Dickins wrote: >>>> On Thu, 17 Apr 2025, David Hildenbrand wrote: >>>>> On 17.04.25 09:18, David Hildenbrand wrote: >>>>>> On 17.04.25 07:36, Hugh Dickins wrote: >>>>>>> On Wed, 16 Apr 2025, David Hildenbrand wrote: >>>>>>>> >>>>>>>> Why not something like >>>>>>>> >>>>>>>> struct folio *entry_folio; >>>>>>>> >>>>>>>> if (folio) { >>>>>>>> if (is_pmd_migration_entry(*pmd)) >>>>>>>> entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); >>>>>>>> else >>>>>>>> entry_folio = pmd_folio(*pmd)); >>>>>>>> >>>>>>>> if (folio != entry_folio) >>>>>>>> return; >>>>>>>> } >>>>>>> >>>>>>> My own preference is to not add unnecessary code: >>>>>>> if folio and pmd_migration entry, we're not interested in entry_folio. >>>>>>> But yes it could be written in lots of other ways. >>>>>> >>>>>> While I don't disagree about "not adding unnecessary code" in general, >>>>>> in this particular case just looking the folio up properly might be the >>>>>> better alternative to reasoning about locking rules with conditional >>>>>> input parameters :) >>>>>> >>>>> >>>>> FWIW, I was wondering if we can rework that code, letting the caller to the >>>>> checking and getting rid of the folio parameter. Something like this >>>>> (incomplete, just to >>>>> discuss if we could move the TTU_SPLIT_HUGE_PMD handling). >>>> >>>> Yes, I too dislike the folio parameter used for a single case, and agree >>>> it's better for the caller who chose pmd to check that *pmd fits the folio. >>>> >>>> I haven't checked your code below, but it looks like a much better way >>>> to proceed, using the page_vma_mapped_walk() to get pmd lock and check; >>>> and cutting out two or more layers of split_huge_pmd obscurity. >>>> >>>> Way to go. However... what we want right now is a fix that can easily >>>> go to stable: the rearrangements here in 6.15-rc mean, I think, that >>>> whatever goes into the current tree will have to be placed differently >>>> for stable, no seamless backports; but Gavin's patch (reworked if you >>>> insist) can be adapted to stable (differently for different releases) >>>> more more easily than the future direction you're proposing here. >>> >>> I'm fine with going with the current patch and looking into cleaning it up properly (if possible). >>> >>> So for this patch >>> >>> Acked-by: David Hildenbrand <david@redhat.com> >>> >>> @Gavin, can you look into cleaning that up? >> >> Thank you for your review. Before I begin the cleanup, could you please >> confirm the following action items: >> >> Zi Yan's suggestions for the patch are: >> 1. Replace the page fault with an invalid address access in the commit >> description. >> >> 2. Simplify the nested if-statements into a single if-statement to >> reduce indentation. > > 3. Can you please add Huge’s explanation below in the commit log? > That clarifies the issue. Thank you for the fix. Sure, will send out another patch for your review. Thank you for the review. > > “ > an anon_vma lookup points to a > location which may contain the folio of interest, but might instead > contain another folio: and weeding out those other folios is precisely > what the "folio != pmd_folio((*pmd)" check (and the "risk of replacing > the wrong folio" comment a few lines above it) is for. > ” > > With that, Acked-by: Zi Yan <ziy@nvidia.com> > >> >> David, based on your comment, I understand that you are recommending the >> entry_folio implementation. Also, from your discussion with Hugh, it >> appears you agreed with my original approach of returning early when >> encountering a PMD migration entry, thereby avoiding unnecessary checks. >> Is that correct? If so, I will keep the current logic. Do you have any >> additional cleanup suggestions? >> >> I will start the cleanup work after confirmation. >> >>> >>>> >>>> (Hmm, that may be another reason for preferring the reasoning by >>>> folio lock: forgive me if I'm misremembering, but didn't those >>>> page migration swapops get renamed, some time around 5.11?) >>> >>> I remember that we did something to PTE handling stuff in the context of PTE markers. But things keep changing all of the time .. :) >>> > > > Best Regards, > Yan, Zi
On 4/17/25 19:36, David Hildenbrand wrote: > On 17.04.25 13:21, Gavin Guo wrote: >> On 4/17/25 17:04, David Hildenbrand wrote: >>> On 17.04.25 10:55, Hugh Dickins wrote: >>>> On Thu, 17 Apr 2025, David Hildenbrand wrote: >>>>> On 17.04.25 09:18, David Hildenbrand wrote: >>>>>> On 17.04.25 07:36, Hugh Dickins wrote: >>>>>>> On Wed, 16 Apr 2025, David Hildenbrand wrote: >>>>>>>> >>>>>>>> Why not something like >>>>>>>> >>>>>>>> struct folio *entry_folio; >>>>>>>> >>>>>>>> if (folio) { >>>>>>>> if (is_pmd_migration_entry(*pmd)) >>>>>>>> entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); >>>>>>>> else >>>>>>>> entry_folio = pmd_folio(*pmd)); >>>>>>>> >>>>>>>> if (folio != entry_folio) >>>>>>>> return; >>>>>>>> } >>>>>>> >>>>>>> My own preference is to not add unnecessary code: >>>>>>> if folio and pmd_migration entry, we're not interested in >>>>>>> entry_folio. >>>>>>> But yes it could be written in lots of other ways. >>>>>> >>>>>> While I don't disagree about "not adding unnecessary code" in >>>>>> general, >>>>>> in this particular case just looking the folio up properly might >>>>>> be the >>>>>> better alternative to reasoning about locking rules with conditional >>>>>> input parameters :) >>>>>> >>>>> >>>>> FWIW, I was wondering if we can rework that code, letting the caller >>>>> to the >>>>> checking and getting rid of the folio parameter. Something like this >>>>> (incomplete, just to >>>>> discuss if we could move the TTU_SPLIT_HUGE_PMD handling). >>>> >>>> Yes, I too dislike the folio parameter used for a single case, and >>>> agree >>>> it's better for the caller who chose pmd to check that *pmd fits the >>>> folio. >>>> >>>> I haven't checked your code below, but it looks like a much better way >>>> to proceed, using the page_vma_mapped_walk() to get pmd lock and check; >>>> and cutting out two or more layers of split_huge_pmd obscurity. >>>> >>>> Way to go. However... what we want right now is a fix that can easily >>>> go to stable: the rearrangements here in 6.15-rc mean, I think, that >>>> whatever goes into the current tree will have to be placed differently >>>> for stable, no seamless backports; but Gavin's patch (reworked if you >>>> insist) can be adapted to stable (differently for different releases) >>>> more more easily than the future direction you're proposing here. >>> >>> I'm fine with going with the current patch and looking into cleaning it >>> up properly (if possible). >>> >>> So for this patch >>> >>> Acked-by: David Hildenbrand <david@redhat.com> >>> >>> @Gavin, can you look into cleaning that up? >> >> Thank you for your review. Before I begin the cleanup, could you please >> confirm the following action items: >> >> Zi Yan's suggestions for the patch are: >> 1. Replace the page fault with an invalid address access in the commit >> description. > > Yes, that makes sense. > >> >> 2. Simplify the nested if-statements into a single if-statement to >> reduce indentation. >> >> David, based on your comment, I understand that you are recommending the >> entry_folio implementation. Also, from your discussion with Hugh, it >> appears you agreed with my original approach of returning early when >> encountering a PMD migration entry, thereby avoiding unnecessary checks. >> Is that correct? If so, I will keep the current logic. Do you have any >> additional cleanup suggestions? > > Yes, the current patch is okay for upstream+stable, but we should look > into cleaning that up. Ah, I understand. I'll clean up the current one and propose another patch based on your RFC patch. Thanks for your review! > > See the cleanup RFC patch I posted where we remove the folio check > completely. Please let me know if you need more information. >
On 17 Apr 2025, at 8:02, Gavin Guo wrote: > On 4/17/25 19:32, Zi Yan wrote: >> On 17 Apr 2025, at 7:21, Gavin Guo wrote: >> >>> On 4/17/25 17:04, David Hildenbrand wrote: >>>> On 17.04.25 10:55, Hugh Dickins wrote: >>>>> On Thu, 17 Apr 2025, David Hildenbrand wrote: >>>>>> On 17.04.25 09:18, David Hildenbrand wrote: >>>>>>> On 17.04.25 07:36, Hugh Dickins wrote: >>>>>>>> On Wed, 16 Apr 2025, David Hildenbrand wrote: >>>>>>>>> >>>>>>>>> Why not something like >>>>>>>>> >>>>>>>>> struct folio *entry_folio; >>>>>>>>> >>>>>>>>> if (folio) { >>>>>>>>> if (is_pmd_migration_entry(*pmd)) >>>>>>>>> entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); >>>>>>>>> else >>>>>>>>> entry_folio = pmd_folio(*pmd)); >>>>>>>>> >>>>>>>>> if (folio != entry_folio) >>>>>>>>> return; >>>>>>>>> } >>>>>>>> >>>>>>>> My own preference is to not add unnecessary code: >>>>>>>> if folio and pmd_migration entry, we're not interested in entry_folio. >>>>>>>> But yes it could be written in lots of other ways. >>>>>>> >>>>>>> While I don't disagree about "not adding unnecessary code" in general, >>>>>>> in this particular case just looking the folio up properly might be the >>>>>>> better alternative to reasoning about locking rules with conditional >>>>>>> input parameters :) >>>>>>> >>>>>> >>>>>> FWIW, I was wondering if we can rework that code, letting the caller to the >>>>>> checking and getting rid of the folio parameter. Something like this >>>>>> (incomplete, just to >>>>>> discuss if we could move the TTU_SPLIT_HUGE_PMD handling). >>>>> >>>>> Yes, I too dislike the folio parameter used for a single case, and agree >>>>> it's better for the caller who chose pmd to check that *pmd fits the folio. >>>>> >>>>> I haven't checked your code below, but it looks like a much better way >>>>> to proceed, using the page_vma_mapped_walk() to get pmd lock and check; >>>>> and cutting out two or more layers of split_huge_pmd obscurity. >>>>> >>>>> Way to go. However... what we want right now is a fix that can easily >>>>> go to stable: the rearrangements here in 6.15-rc mean, I think, that >>>>> whatever goes into the current tree will have to be placed differently >>>>> for stable, no seamless backports; but Gavin's patch (reworked if you >>>>> insist) can be adapted to stable (differently for different releases) >>>>> more more easily than the future direction you're proposing here. >>>> >>>> I'm fine with going with the current patch and looking into cleaning it up properly (if possible). >>>> >>>> So for this patch >>>> >>>> Acked-by: David Hildenbrand <david@redhat.com> >>>> >>>> @Gavin, can you look into cleaning that up? >>> >>> Thank you for your review. Before I begin the cleanup, could you please >>> confirm the following action items: >>> >>> Zi Yan's suggestions for the patch are: >>> 1. Replace the page fault with an invalid address access in the commit >>> description. >>> >>> 2. Simplify the nested if-statements into a single if-statement to >>> reduce indentation. >> >> 3. Can you please add Huge’s explanation below in the commit log? >> That clarifies the issue. Thank you for the fix. > > Sure, will send out another patch for your review. Thank you for the review. > Thanks. Do you mind sharing the syzkaller reproducer if that is possible and easy? I am trying to understand more about the issue. >> >> “ >> an anon_vma lookup points to a >> location which may contain the folio of interest, but might instead >> contain another folio: and weeding out those other folios is precisely >> what the "folio != pmd_folio((*pmd)" check (and the "risk of replacing >> the wrong folio" comment a few lines above it) is for. >> ” >> >> With that, Acked-by: Zi Yan <ziy@nvidia.com> >> >>> >>> David, based on your comment, I understand that you are recommending the >>> entry_folio implementation. Also, from your discussion with Hugh, it >>> appears you agreed with my original approach of returning early when >>> encountering a PMD migration entry, thereby avoiding unnecessary checks. >>> Is that correct? If so, I will keep the current logic. Do you have any >>> additional cleanup suggestions? >>> >>> I will start the cleanup work after confirmation. >>> >>>> >>>>> >>>>> (Hmm, that may be another reason for preferring the reasoning by >>>>> folio lock: forgive me if I'm misremembering, but didn't those >>>>> page migration swapops get renamed, some time around 5.11?) >>>> >>>> I remember that we did something to PTE handling stuff in the context of PTE markers. But things keep changing all of the time .. :) >>>> >> >> >> Best Regards, >> Yan, Zi Best Regards, Yan, Zi
On 4/17/25 20:10, Zi Yan wrote: > On 17 Apr 2025, at 8:02, Gavin Guo wrote: > >> On 4/17/25 19:32, Zi Yan wrote: >>> On 17 Apr 2025, at 7:21, Gavin Guo wrote: >>> >>>> On 4/17/25 17:04, David Hildenbrand wrote: >>>>> On 17.04.25 10:55, Hugh Dickins wrote: >>>>>> On Thu, 17 Apr 2025, David Hildenbrand wrote: >>>>>>> On 17.04.25 09:18, David Hildenbrand wrote: >>>>>>>> On 17.04.25 07:36, Hugh Dickins wrote: >>>>>>>>> On Wed, 16 Apr 2025, David Hildenbrand wrote: >>>>>>>>>> >>>>>>>>>> Why not something like >>>>>>>>>> >>>>>>>>>> struct folio *entry_folio; >>>>>>>>>> >>>>>>>>>> if (folio) { >>>>>>>>>> if (is_pmd_migration_entry(*pmd)) >>>>>>>>>> entry_folio = pfn_swap_entry_folio(pmd_to_swp_entry(*pmd))); >>>>>>>>>> else >>>>>>>>>> entry_folio = pmd_folio(*pmd)); >>>>>>>>>> >>>>>>>>>> if (folio != entry_folio) >>>>>>>>>> return; >>>>>>>>>> } >>>>>>>>> >>>>>>>>> My own preference is to not add unnecessary code: >>>>>>>>> if folio and pmd_migration entry, we're not interested in entry_folio. >>>>>>>>> But yes it could be written in lots of other ways. >>>>>>>> >>>>>>>> While I don't disagree about "not adding unnecessary code" in general, >>>>>>>> in this particular case just looking the folio up properly might be the >>>>>>>> better alternative to reasoning about locking rules with conditional >>>>>>>> input parameters :) >>>>>>>> >>>>>>> >>>>>>> FWIW, I was wondering if we can rework that code, letting the caller to the >>>>>>> checking and getting rid of the folio parameter. Something like this >>>>>>> (incomplete, just to >>>>>>> discuss if we could move the TTU_SPLIT_HUGE_PMD handling). >>>>>> >>>>>> Yes, I too dislike the folio parameter used for a single case, and agree >>>>>> it's better for the caller who chose pmd to check that *pmd fits the folio. >>>>>> >>>>>> I haven't checked your code below, but it looks like a much better way >>>>>> to proceed, using the page_vma_mapped_walk() to get pmd lock and check; >>>>>> and cutting out two or more layers of split_huge_pmd obscurity. >>>>>> >>>>>> Way to go. However... what we want right now is a fix that can easily >>>>>> go to stable: the rearrangements here in 6.15-rc mean, I think, that >>>>>> whatever goes into the current tree will have to be placed differently >>>>>> for stable, no seamless backports; but Gavin's patch (reworked if you >>>>>> insist) can be adapted to stable (differently for different releases) >>>>>> more more easily than the future direction you're proposing here. >>>>> >>>>> I'm fine with going with the current patch and looking into cleaning it up properly (if possible). >>>>> >>>>> So for this patch >>>>> >>>>> Acked-by: David Hildenbrand <david@redhat.com> >>>>> >>>>> @Gavin, can you look into cleaning that up? >>>> >>>> Thank you for your review. Before I begin the cleanup, could you please >>>> confirm the following action items: >>>> >>>> Zi Yan's suggestions for the patch are: >>>> 1. Replace the page fault with an invalid address access in the commit >>>> description. >>>> >>>> 2. Simplify the nested if-statements into a single if-statement to >>>> reduce indentation. >>> >>> 3. Can you please add Huge’s explanation below in the commit log? >>> That clarifies the issue. Thank you for the fix. >> >> Sure, will send out another patch for your review. Thank you for the review. >> > Thanks. Do you mind sharing the syzkaller reproducer if that is > possible and easy? I am trying to understand more about the issue. Sure, this is the reproducer: https://drive.google.com/file/d/1eDBV6VfIzyqD9SeYGQBah-BJXO32Piy8/view Reproducing steps 1). gcc -o repro -lpthread -static ./repro.c 2). ./repro 3). Find the group number and replace 2539 in the following sudo cat /sys/kernel/debug/shrinker/thp-deferred_split-12/count 4). Run the following command in multiple sessions for i in $(seq 10000); do echo "2539 0 100" | sudo tee /sys/kernel/debug/shrinker/thp-deferred_split-12/scan ; done Generally, the bug will be triggered within 5 minutes. > >>> >>> “ >>> an anon_vma lookup points to a >>> location which may contain the folio of interest, but might instead >>> contain another folio: and weeding out those other folios is precisely >>> what the "folio != pmd_folio((*pmd)" check (and the "risk of replacing >>> the wrong folio" comment a few lines above it) is for. >>> ” >>> >>> With that, Acked-by: Zi Yan <ziy@nvidia.com> >>> >>>> >>>> David, based on your comment, I understand that you are recommending the >>>> entry_folio implementation. Also, from your discussion with Hugh, it >>>> appears you agreed with my original approach of returning early when >>>> encountering a PMD migration entry, thereby avoiding unnecessary checks. >>>> Is that correct? If so, I will keep the current logic. Do you have any >>>> additional cleanup suggestions? >>>> >>>> I will start the cleanup work after confirmation. >>>> >>>>> >>>>>> >>>>>> (Hmm, that may be another reason for preferring the reasoning by >>>>>> folio lock: forgive me if I'm misremembering, but didn't those >>>>>> page migration swapops get renamed, some time around 5.11?) >>>>> >>>>> I remember that we did something to PTE handling stuff in the context of PTE markers. But things keep changing all of the time .. :) >>>>> >>> >>> >>> Best Regards, >>> Yan, Zi > > > Best Regards, > Yan, Zi
On 17 Apr 2025, at 1:29, Hugh Dickins wrote: > On Tue, 15 Apr 2025, Zi Yan wrote: >> >> Anyway, we need to figure out why both THP migration and deferred_split_scan() >> hold the THP lock first, which sounds impossible to me. Or some other execution >> interleaving is happening. > > I think perhaps you're missing that an anon_vma lookup points to a > location which may contain the folio of interest, but might instead > contain another folio: and weeding out those other folios is precisely > what the "folio != pmd_folio((*pmd)" check (and the "risk of replacing > the wrong folio" comment a few lines above it) is for. Yes, from Gavin’s commit log, I thought both migration and deferred split are working on the same folio. But after reread it along with your explanation, now I understand that both are working on the same pmd migration entry. Thank you for the explanation. > > The "BUG: unable to handle page fault" comes about because that other > folio might actually be being migrated at this time, so we encounter > a PMD migration entry instead of a valid PMD entry. But if it's the > folio we're looking for, our folio lock excludes a racing migration, > so it would never be a PMD migration entry for our folio. > > Hugh Best Regards, Yan, Zi
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2a47682d1ab7..0cb9547dcff2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3075,6 +3075,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, bool freeze, struct folio *folio) { + bool pmd_migration = is_pmd_migration_entry(*pmd); + VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); @@ -3085,10 +3087,18 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, * require a folio to check the PMD against. Otherwise, there * is a risk of replacing the wrong folio. */ - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || - is_pmd_migration_entry(*pmd)) { - if (folio && folio != pmd_folio(*pmd)) - return; + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) { + if (folio) { + /* + * Do not apply pmd_folio() to a migration entry; and + * folio lock guarantees that it must be of the wrong + * folio anyway. + */ + if (pmd_migration) + return; + if (folio != pmd_folio(*pmd)) + return; + } __split_huge_pmd_locked(vma, pmd, address, freeze); } }
When migrating a THP, concurrent access to the PMD migration entry during a deferred split scan can lead to a page fault, as illustrated below. To prevent this page fault, it is necessary to check the PMD migration entry and return early. In this context, there is no need to use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality of the target folio. Since the PMD migration entry is locked, it cannot be served as the target. BUG: unable to handle page fault for address: ffffea60001db008 CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 Call Trace: <TASK> try_to_migrate_one+0x28c/0x3730 rmap_walk_anon+0x4f6/0x770 unmap_folio+0x196/0x1f0 split_huge_page_to_list_to_order+0x9f6/0x1560 deferred_split_scan+0xac5/0x12a0 shrinker_debugfs_scan_write+0x376/0x470 full_proxy_write+0x15c/0x220 vfs_write+0x2fc/0xcb0 ksys_write+0x146/0x250 do_syscall_64+0x6a/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e The bug is found by syzkaller on an internal kernel, then confirmed on upstream. Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Cc: stable@vger.kernel.org Signed-off-by: Gavin Guo <gavinguo@igalia.com> --- mm/huge_memory.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) base-commit: a24588245776dafc227243a01bfbeb8a59bafba9