Message ID | 20240322193304.522496-1-zi.yan@sent.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v5] mm/migrate: split source folio if it is on deferred split list | expand |
On 2024/3/23 03:33, Zi Yan wrote: > From: Zi Yan <ziy@nvidia.com> > > If the source folio is on deferred split list, it is likely some subpages > are not used. Split it before migration to avoid migrating unused subpages. > > Commit 616b8371539a6 ("mm: thp: enable thp migration in generic path") > did not check if a THP is on deferred split list before migration, thus, > the destination THP is never put on deferred split list even if the source > THP might be. The opportunity of reclaiming free pages in a partially > mapped THP during deferred list scanning is lost, but no other harmful > consequence is present[1]. > > From v4: > 1. Simplify _deferred_list check without locking and do not count as > migration failures. (per Matthew Wilcox) > > From v3: > 1. Guarded deferred list code behind CONFIG_TRANSPARENT_HUGEPAGE to avoid > compilation error (per SeongJae Park). > > From v2: > 1. Split the source folio instead of migrating it (per Matthew Wilcox)[2]. > > From v1: > 1. Used dst to get correct deferred split list after migration > (per Ryan Roberts). > > [1]: https://lore.kernel.org/linux-mm/03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com/ > [2]: https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/ > > Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path") > Signed-off-by: Zi Yan <ziy@nvidia.com> > --- > mm/migrate.c | 23 +++++++++++++++++++++++ > 1 file changed, 23 insertions(+) > > diff --git a/mm/migrate.c b/mm/migrate.c > index ab9856f5931b..6bd9319624a3 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1652,6 +1652,29 @@ static int migrate_pages_batch(struct list_head *from, > > cond_resched(); > > + /* > + * The rare folio on the deferred split list should > + * be split now. It should not count as a failure. > + * Only check it without removing it from the list. > + * Since the folio can be on deferred_split_scan() > + * local list and removing it can cause the local list > + * corruption. Folio split process below can handle it > + * with the help of folio_ref_freeze(). > + * > + * nr_pages > 2 is needed to avoid checking order-1 > + * page cache folios. They exist, in contrast to > + * non-existent order-1 anonymous folios, and do not > + * use _deferred_list. > + */ > + if (nr_pages > 2 && > + !list_empty(&folio->_deferred_list)) { > + if (try_split_folio(folio, from) == 0) { IMO, we should move the split folios into the 'split_folios' list instead of the 'from' list, otherwise there might be unhandled folios remaining in the from list. > + stats->nr_thp_split += is_thp; > + stats->nr_split++; > + continue; > + } > + } > + > /* > * Large folio migration might be unsupported or > * the allocation might be failed so we should retry > > base-commit: 08a487ab26d541a3bd0adaee144f684b724d233b
On 26 Mar 2024, at 2:19, Baolin Wang wrote: > On 2024/3/23 03:33, Zi Yan wrote: >> From: Zi Yan <ziy@nvidia.com> >> >> If the source folio is on deferred split list, it is likely some subpages >> are not used. Split it before migration to avoid migrating unused subpages. >> >> Commit 616b8371539a6 ("mm: thp: enable thp migration in generic path") >> did not check if a THP is on deferred split list before migration, thus, >> the destination THP is never put on deferred split list even if the source >> THP might be. The opportunity of reclaiming free pages in a partially >> mapped THP during deferred list scanning is lost, but no other harmful >> consequence is present[1]. >> >> From v4: >> 1. Simplify _deferred_list check without locking and do not count as >> migration failures. (per Matthew Wilcox) >> >> From v3: >> 1. Guarded deferred list code behind CONFIG_TRANSPARENT_HUGEPAGE to avoid >> compilation error (per SeongJae Park). >> >> From v2: >> 1. Split the source folio instead of migrating it (per Matthew Wilcox)[2]. >> >> From v1: >> 1. Used dst to get correct deferred split list after migration >> (per Ryan Roberts). >> >> [1]: https://lore.kernel.org/linux-mm/03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com/ >> [2]: https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/ >> >> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path") >> Signed-off-by: Zi Yan <ziy@nvidia.com> >> --- >> mm/migrate.c | 23 +++++++++++++++++++++++ >> 1 file changed, 23 insertions(+) >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> index ab9856f5931b..6bd9319624a3 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -1652,6 +1652,29 @@ static int migrate_pages_batch(struct list_head *from, >> cond_resched(); >> + /* >> + * The rare folio on the deferred split list should >> + * be split now. It should not count as a failure. >> + * Only check it without removing it from the list. >> + * Since the folio can be on deferred_split_scan() >> + * local list and removing it can cause the local list >> + * corruption. Folio split process below can handle it >> + * with the help of folio_ref_freeze(). >> + * >> + * nr_pages > 2 is needed to avoid checking order-1 >> + * page cache folios. They exist, in contrast to >> + * non-existent order-1 anonymous folios, and do not >> + * use _deferred_list. >> + */ >> + if (nr_pages > 2 && >> + !list_empty(&folio->_deferred_list)) { >> + if (try_split_folio(folio, from) == 0) { > > IMO, we should move the split folios into the 'split_folios' list instead of the 'from' list, otherwise there might be unhandled folios remaining in the from list. Can you elaborate on the actual situation you are thinking about? Thanks. > >> + stats->nr_thp_split += is_thp; >> + stats->nr_split++; >> + continue; >> + } >> + } >> + >> /* >> * Large folio migration might be unsupported or >> * the allocation might be failed so we should retry >> >> base-commit: 08a487ab26d541a3bd0adaee144f684b724d233b -- Best Regards, Yan, Zi
On 2024/3/26 21:26, Zi Yan wrote: > On 26 Mar 2024, at 2:19, Baolin Wang wrote: > >> On 2024/3/23 03:33, Zi Yan wrote: >>> From: Zi Yan <ziy@nvidia.com> >>> >>> If the source folio is on deferred split list, it is likely some subpages >>> are not used. Split it before migration to avoid migrating unused subpages. >>> >>> Commit 616b8371539a6 ("mm: thp: enable thp migration in generic path") >>> did not check if a THP is on deferred split list before migration, thus, >>> the destination THP is never put on deferred split list even if the source >>> THP might be. The opportunity of reclaiming free pages in a partially >>> mapped THP during deferred list scanning is lost, but no other harmful >>> consequence is present[1]. >>> >>> From v4: >>> 1. Simplify _deferred_list check without locking and do not count as >>> migration failures. (per Matthew Wilcox) >>> >>> From v3: >>> 1. Guarded deferred list code behind CONFIG_TRANSPARENT_HUGEPAGE to avoid >>> compilation error (per SeongJae Park). >>> >>> From v2: >>> 1. Split the source folio instead of migrating it (per Matthew Wilcox)[2]. >>> >>> From v1: >>> 1. Used dst to get correct deferred split list after migration >>> (per Ryan Roberts). >>> >>> [1]: https://lore.kernel.org/linux-mm/03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com/ >>> [2]: https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/ >>> >>> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path") >>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>> --- >>> mm/migrate.c | 23 +++++++++++++++++++++++ >>> 1 file changed, 23 insertions(+) >>> >>> diff --git a/mm/migrate.c b/mm/migrate.c >>> index ab9856f5931b..6bd9319624a3 100644 >>> --- a/mm/migrate.c >>> +++ b/mm/migrate.c >>> @@ -1652,6 +1652,29 @@ static int migrate_pages_batch(struct list_head *from, >>> cond_resched(); >>> + /* >>> + * The rare folio on the deferred split list should >>> + * be split now. It should not count as a failure. >>> + * Only check it without removing it from the list. >>> + * Since the folio can be on deferred_split_scan() >>> + * local list and removing it can cause the local list >>> + * corruption. Folio split process below can handle it >>> + * with the help of folio_ref_freeze(). >>> + * >>> + * nr_pages > 2 is needed to avoid checking order-1 >>> + * page cache folios. They exist, in contrast to >>> + * non-existent order-1 anonymous folios, and do not >>> + * use _deferred_list. >>> + */ >>> + if (nr_pages > 2 && >>> + !list_empty(&folio->_deferred_list)) { >>> + if (try_split_folio(folio, from) == 0) { >> >> IMO, we should move the split folios into the 'split_folios' list instead of the 'from' list, otherwise there might be unhandled folios remaining in the from list. > > Can you elaborate on the actual situation you are thinking about? Thanks. Sure. Suppose there is only one large folio in the from list that needs to be migrated, and this large folio is in the _deferred_list, which means it needs to be split. Your patch will re-add the split base pages back into the 'from' list. However, please see the list_for_each_entry_safe macro: #define list_for_each_entry_safe(pos, n, head, member) \ for (pos = list_first_entry(head, typeof(*pos), member), \ n = list_next_entry(pos, member); \ !list_entry_is_head(pos, head, member); \ pos = n, n = list_next_entry(n, member)) It will terminate the iteration early because the next entry 'n' taken out in advance is already the head, leading to the remaining split base pages still in the from list. This can cause the following crash when I did some migration testing: [ 412.576943] ------------[ cut here ]------------ [ 412.576947] kernel BUG at mm/migrate.c:2634! [ 412.577132] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 412.577201] CPU: 59 PID: 9581 Comm: numa01 Kdump: loaded Tainted: G E 6.9.0-rc1+ #69 ........ [ 412.578651] Call Trace: [ 412.578692] <TASK> [ 412.578730] ? die+0x33/0x90 [ 412.578770] ? do_trap+0xdf/0x110 [ 412.578815] ? migrate_misplaced_folio+0x1f2/0x2b0 [ 412.578875] ? do_error_trap+0x65/0x80 [ 412.578922] ? migrate_misplaced_folio+0x1f2/0x2b0 [ 412.578977] ? exc_invalid_op+0x4e/0x70 [ 412.579048] ? migrate_misplaced_folio+0x1f2/0x2b0 [ 412.579131] ? asm_exc_invalid_op+0x16/0x20 [ 412.579182] ? migrate_misplaced_folio+0x1f2/0x2b0 [ 412.579255] do_numa_page+0x205/0x5b0 [ 412.579305] __handle_mm_fault+0x2b0/0x6c0 [ 412.579354] handle_mm_fault+0x105/0x270 [ 412.579404] do_user_addr_fault+0x214/0x6b0 [ 412.579453] exc_page_fault+0x64/0x140 [ 412.579509] asm_exc_page_fault+0x22/0x30 2583 int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, 2584 int node) 2585 { ...... 2628 if (nr_succeeded) { 2629 count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); 2630 if (!node_is_toptier(folio_nid(folio)) && node_is_toptier(node)) 2631 mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, 2632 nr_succeeded); 2633 } 2634 BUG_ON(!list_empty(&migratepages)); 2635 return isolated; 2636 2637 out: After changing as below, the system crash issue is gone. +++ b/mm/migrate.c @@ -1668,7 +1668,7 @@ static int migrate_pages_batch(struct list_head *from, */ if (nr_pages > 2 && !list_empty(&folio->_deferred_list)) { - if (try_split_folio(folio, from) == 0) { + if (try_split_folio(folio, split_folios) == 0) { stats->nr_thp_split += is_thp; stats->nr_split++; continue;
On 26 Mar 2024, at 10:42, Baolin Wang wrote: > On 2024/3/26 21:26, Zi Yan wrote: >> On 26 Mar 2024, at 2:19, Baolin Wang wrote: >> >>> On 2024/3/23 03:33, Zi Yan wrote: >>>> From: Zi Yan <ziy@nvidia.com> >>>> >>>> If the source folio is on deferred split list, it is likely some subpages >>>> are not used. Split it before migration to avoid migrating unused subpages. >>>> >>>> Commit 616b8371539a6 ("mm: thp: enable thp migration in generic path") >>>> did not check if a THP is on deferred split list before migration, thus, >>>> the destination THP is never put on deferred split list even if the source >>>> THP might be. The opportunity of reclaiming free pages in a partially >>>> mapped THP during deferred list scanning is lost, but no other harmful >>>> consequence is present[1]. >>>> >>>> From v4: >>>> 1. Simplify _deferred_list check without locking and do not count as >>>> migration failures. (per Matthew Wilcox) >>>> >>>> From v3: >>>> 1. Guarded deferred list code behind CONFIG_TRANSPARENT_HUGEPAGE to avoid >>>> compilation error (per SeongJae Park). >>>> >>>> From v2: >>>> 1. Split the source folio instead of migrating it (per Matthew Wilcox)[2]. >>>> >>>> From v1: >>>> 1. Used dst to get correct deferred split list after migration >>>> (per Ryan Roberts). >>>> >>>> [1]: https://lore.kernel.org/linux-mm/03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com/ >>>> [2]: https://lore.kernel.org/linux-mm/Ze_P6xagdTbcu1Kz@casper.infradead.org/ >>>> >>>> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path") >>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>> --- >>>> mm/migrate.c | 23 +++++++++++++++++++++++ >>>> 1 file changed, 23 insertions(+) >>>> >>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>> index ab9856f5931b..6bd9319624a3 100644 >>>> --- a/mm/migrate.c >>>> +++ b/mm/migrate.c >>>> @@ -1652,6 +1652,29 @@ static int migrate_pages_batch(struct list_head *from, >>>> cond_resched(); >>>> + /* >>>> + * The rare folio on the deferred split list should >>>> + * be split now. It should not count as a failure. >>>> + * Only check it without removing it from the list. >>>> + * Since the folio can be on deferred_split_scan() >>>> + * local list and removing it can cause the local list >>>> + * corruption. Folio split process below can handle it >>>> + * with the help of folio_ref_freeze(). >>>> + * >>>> + * nr_pages > 2 is needed to avoid checking order-1 >>>> + * page cache folios. They exist, in contrast to >>>> + * non-existent order-1 anonymous folios, and do not >>>> + * use _deferred_list. >>>> + */ >>>> + if (nr_pages > 2 && >>>> + !list_empty(&folio->_deferred_list)) { >>>> + if (try_split_folio(folio, from) == 0) { >>> >>> IMO, we should move the split folios into the 'split_folios' list instead of the 'from' list, otherwise there might be unhandled folios remaining in the from list. >> >> Can you elaborate on the actual situation you are thinking about? Thanks. > > Sure. > > Suppose there is only one large folio in the from list that needs to be migrated, and this large folio is in the _deferred_list, which means it needs to be split. Your patch will re-add the split base pages back into the 'from' list. However, please see the list_for_each_entry_safe macro: > > #define list_for_each_entry_safe(pos, n, head, member) \ > for (pos = list_first_entry(head, typeof(*pos), member), \ > n = list_next_entry(pos, member); \ > !list_entry_is_head(pos, head, member); \ > pos = n, n = list_next_entry(n, member)) > > It will terminate the iteration early because the next entry 'n' taken out in advance is already the head, leading to the remaining split base pages still in the from list. This can cause the following crash when I did some migration testing: > > [ 412.576943] ------------[ cut here ]------------ > [ 412.576947] kernel BUG at mm/migrate.c:2634! > [ 412.577132] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > [ 412.577201] CPU: 59 PID: 9581 Comm: numa01 Kdump: loaded Tainted: G E 6.9.0-rc1+ #69 > ........ > [ 412.578651] Call Trace: > [ 412.578692] <TASK> > [ 412.578730] ? die+0x33/0x90 > [ 412.578770] ? do_trap+0xdf/0x110 > [ 412.578815] ? migrate_misplaced_folio+0x1f2/0x2b0 > [ 412.578875] ? do_error_trap+0x65/0x80 > [ 412.578922] ? migrate_misplaced_folio+0x1f2/0x2b0 > [ 412.578977] ? exc_invalid_op+0x4e/0x70 > [ 412.579048] ? migrate_misplaced_folio+0x1f2/0x2b0 > [ 412.579131] ? asm_exc_invalid_op+0x16/0x20 > [ 412.579182] ? migrate_misplaced_folio+0x1f2/0x2b0 > [ 412.579255] do_numa_page+0x205/0x5b0 > [ 412.579305] __handle_mm_fault+0x2b0/0x6c0 > [ 412.579354] handle_mm_fault+0x105/0x270 > [ 412.579404] do_user_addr_fault+0x214/0x6b0 > [ 412.579453] exc_page_fault+0x64/0x140 > [ 412.579509] asm_exc_page_fault+0x22/0x30 > > 2583 int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, > 2584 int node) > 2585 { > ...... > > 2628 if (nr_succeeded) { > 2629 count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); > 2630 if (!node_is_toptier(folio_nid(folio)) && node_is_toptier(node)) > 2631 mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, > 2632 nr_succeeded); > 2633 } > 2634 BUG_ON(!list_empty(&migratepages)); > 2635 return isolated; > 2636 > 2637 out: Got it. Thanks. > > After changing as below, the system crash issue is gone. > > +++ b/mm/migrate.c > @@ -1668,7 +1668,7 @@ static int migrate_pages_batch(struct list_head *from, > */ > if (nr_pages > 2 && > !list_empty(&folio->_deferred_list)) { > - if (try_split_folio(folio, from) == 0) { > + if (try_split_folio(folio, split_folios) == 0) { > stats->nr_thp_split += is_thp; > stats->nr_split++; > continue; Let me resend with this fix. -- Best Regards, Yan, Zi
diff --git a/mm/migrate.c b/mm/migrate.c index ab9856f5931b..6bd9319624a3 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1652,6 +1652,29 @@ static int migrate_pages_batch(struct list_head *from, cond_resched(); + /* + * The rare folio on the deferred split list should + * be split now. It should not count as a failure. + * Only check it without removing it from the list. + * Since the folio can be on deferred_split_scan() + * local list and removing it can cause the local list + * corruption. Folio split process below can handle it + * with the help of folio_ref_freeze(). + * + * nr_pages > 2 is needed to avoid checking order-1 + * page cache folios. They exist, in contrast to + * non-existent order-1 anonymous folios, and do not + * use _deferred_list. + */ + if (nr_pages > 2 && + !list_empty(&folio->_deferred_list)) { + if (try_split_folio(folio, from) == 0) { + stats->nr_thp_split += is_thp; + stats->nr_split++; + continue; + } + } + /* * Large folio migration might be unsupported or * the allocation might be failed so we should retry