Message ID | 20231214045841.961776-1-willy@infradead.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Migrate high-order folios in swap cache correctly | expand |
On Thu, 14 Dec 2023 04:58:41 +0000 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote: > From: Charan Teja Kalla <quic_charante@quicinc.com> > > Large folios occupy N consecutive entries in the swap cache > instead of using multi-index entries like the page cache. > However, if a large folio is re-added to the LRU list, it can > be migrated. The migration code was not aware of the difference > between the swap cache and the page cache and assumed that a single > xas_store() would be sufficient. > > This leaves potentially many stale pointers to the now-migrated folio > in the swap cache, which can lead to almost arbitrary data corruption > in the future. This can also manifest as infinite loops with the > RCU read lock held. > > Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> > [modifications to the changelog & tweaked the fix] > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> I'm thinking Fixes: 3417013e0d183be ("mm/migrate: Add folio_migrate_mapping()") hence Cc: <stable@vger.kernel.org> And also Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com
Hi Matthew, It seems that issue is not completely fixed with the below change. This time it is because of not filling the ->private for subpages of THP. This can result into same type of issue this patch is tried for: 1) For reclaim of anon THP, page is first added to swap cache. As part of this, contig swap space of THP size is allocated, fills the folio_nr_pages of swap cache indices with folio, set each subpage ->private with respective swap entry. add_to_swap_cache(): for (i = 0; i < nr; i++) { set_page_private(folio_page(folio, i), entry.val + i); xas_store(&xas, folio); xas_next(&xas); } 2) Migrating the folio that is sitting on the swap cache causes only head page of newfolio->private set to swap entry. The tail pages are missed. folio_migrate_mapping(): newfolio->private = folio_get_private(folio); for (i = 0; i < entries; i++) { xas_store(&xas, newfolio); xas_next(&xas); } 3) Now again, when this migrated page that is still sitting on the swap cache, is tried to migrate, this THP page can be split (see migrate_pages()->try_split_thp), which will endup in storing the swap cache entries with sub pages in the respective indices, but this sub pages->private doesn't contain the valid swap cache entry. __split_huge_page(): for (i = nr - 1; i >= 1; i--) { if { ... } else if(swap_cache) { __xa_store(&swap_cache->i_pages, offset + i, head + i, 0); } } This leads to a state where all the sub pages are sitting on the swap cache with ->private not holding the valid value. As the subpage is tried with delete_from_swap_cache(), it tries to replace the __wrong swap cache index__ with NULL/shadow value and subsequently decrease the refcount of the sub page. for (i = 0; i < nr; i++) { if (subpage == page) continue; free_page_and_swap_cache(subpage): free_swap_cache(page); (Calls delete_from_swap_cache()) put_page(page); } Consider a folio just sitting on a swap cache, its subpage entries will now have refcount of 1(ref counts decremented from swap cache deletion + put_page) from 3 (isolate + swap cache + lru_add_page_tail()). Now migrate_pages() is tried again on these splitted THP pages. But the refcount of '1' makes the page freed directly (unmap_and_move). So, this leads to a state of "freed page entry on the swap cache" which can causes various corruptions, loop under RCU lock(observed in mapping_get_entry) e.t.c., Seems below change is also required here. diff --git a/mm/migrate.c b/mm/migrate.c index 9f5f52d..8049f4e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -427,10 +427,8 @@ int folio_migrate_mapping(struct address_space *mapping, folio_ref_add(newfolio, nr); /* add cache reference */ if (folio_test_swapbacked(folio)) { __folio_set_swapbacked(newfolio); - if (folio_test_swapcache(folio)) { + if (folio_test_swapcache(folio)) folio_set_swapcache(newfolio); - newfolio->private = folio_get_private(folio); - } entries = nr; } else { VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); @@ -446,6 +444,8 @@ int folio_migrate_mapping(struct address_space *mapping, /* Swap cache still stores N entries instead of a high-order entry */ for (i = 0; i < entries; i++) { + set_page_private(folio_page(newfolio, i), + folio_page(folio, i)->private); xas_store(&xas, newfolio); xas_next(&xas); } On 12/14/2023 10:28 AM, Matthew Wilcox (Oracle) wrote: > From: Charan Teja Kalla <quic_charante@quicinc.com> > > Large folios occupy N consecutive entries in the swap cache > instead of using multi-index entries like the page cache. > However, if a large folio is re-added to the LRU list, it can > be migrated. The migration code was not aware of the difference > between the swap cache and the page cache and assumed that a single > xas_store() would be sufficient. > > This leaves potentially many stale pointers to the now-migrated folio > in the swap cache, which can lead to almost arbitrary data corruption > in the future. This can also manifest as infinite loops with the > RCU read lock held. > > Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> > [modifications to the changelog & tweaked the fix] > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > --- > mm/migrate.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index d9d2b9432e81..2d67ca47d2e2 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -405,6 +405,7 @@ int folio_migrate_mapping(struct address_space *mapping, > int dirty; > int expected_count = folio_expected_refs(mapping, folio) + extra_count; > long nr = folio_nr_pages(folio); > + long entries, i; > > if (!mapping) { > /* Anonymous page without mapping */ > @@ -442,8 +443,10 @@ int folio_migrate_mapping(struct address_space *mapping, > folio_set_swapcache(newfolio); > newfolio->private = folio_get_private(folio); > } > + entries = nr; > } else { > VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); > + entries = 1; > } > > /* Move dirty while page refs frozen and newpage not yet exposed */ > @@ -453,7 +456,11 @@ int folio_migrate_mapping(struct address_space *mapping, > folio_set_dirty(newfolio); > } > > - xas_store(&xas, newfolio); > + /* Swap cache still stores N entries instead of a high-order entry */ > + for (i = 0; i < entries; i++) { > + xas_store(&xas, newfolio); > + xas_next(&xas); > + } > > /* > * Drop cache reference from old page by unfreezing
I am sorry that I didn't check the latest kernel before posting. We are working on 6.1 LTS kernel where the mentioned issue is observed. I assume the analysis still holds true on 6.1 kernel. I see that this is not a problem on the latest kernels because of [1], by David. [1] https://lore.kernel.org/linux-mm/20230821160849.531668-2-david@redhat.com/#Z31mm:huge_memory.c Thanks, Charan On 2/6/2024 8:24 PM, Charan Teja Kalla wrote: > Hi Matthew, > > It seems that issue is not completely fixed with the below change. This > time it is because of not filling the ->private for subpages of THP. > This can result into same type of issue this patch is tried for: > > 1) For reclaim of anon THP, page is first added to swap cache. As part > of this, contig swap space of THP size is allocated, fills the > folio_nr_pages of swap cache indices with folio, set each subpage > ->private with respective swap entry. > add_to_swap_cache(): > for (i = 0; i < nr; i++) { > set_page_private(folio_page(folio, i), entry.val + i); > xas_store(&xas, folio); > xas_next(&xas); > } > > 2) Migrating the folio that is sitting on the swap cache causes only > head page of newfolio->private set to swap entry. The tail pages are missed. > folio_migrate_mapping(): > newfolio->private = folio_get_private(folio); > for (i = 0; i < entries; i++) { > xas_store(&xas, newfolio); > xas_next(&xas); > } > > 3) Now again, when this migrated page that is still sitting on the swap > cache, is tried to migrate, this THP page can be split (see > migrate_pages()->try_split_thp), which will endup in storing the swap > cache entries with sub pages in the respective indices, but this sub > pages->private doesn't contain the valid swap cache entry. > __split_huge_page(): > for (i = nr - 1; i >= 1; i--) { > if { > ... > } else if(swap_cache) { > __xa_store(&swap_cache->i_pages, offset + i, > head + i, 0); > } > } > > This leads to a state where all the sub pages are sitting on the swap > cache with ->private not holding the valid value. As the subpage is > tried with delete_from_swap_cache(), it tries to replace the __wrong > swap cache index__ with NULL/shadow value and subsequently decrease the > refcount of the sub page. > for (i = 0; i < nr; i++) { > if (subpage == page) > continue; > > free_page_and_swap_cache(subpage): > free_swap_cache(page); > (Calls delete_from_swap_cache()) > put_page(page); > } > > Consider a folio just sitting on a swap cache, its subpage entries will > now have refcount of 1(ref counts decremented from swap cache deletion + > put_page) from 3 (isolate + swap cache + lru_add_page_tail()). Now > migrate_pages() is tried again on these splitted THP pages. But the > refcount of '1' makes the page freed directly (unmap_and_move). > > So, this leads to a state of "freed page entry on the swap cache" which > can causes various corruptions, loop under RCU lock(observed in > mapping_get_entry) e.t.c., > > Seems below change is also required here. > > diff --git a/mm/migrate.c b/mm/migrate.c > index 9f5f52d..8049f4e 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -427,10 +427,8 @@ int folio_migrate_mapping(struct address_space > *mapping, > folio_ref_add(newfolio, nr); /* add cache reference */ > if (folio_test_swapbacked(folio)) { > __folio_set_swapbacked(newfolio); > - if (folio_test_swapcache(folio)) { > + if (folio_test_swapcache(folio)) > folio_set_swapcache(newfolio); > - newfolio->private = folio_get_private(folio); > - } > entries = nr; > } else { > VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); > @@ -446,6 +444,8 @@ int folio_migrate_mapping(struct address_space *mapping, > > /* Swap cache still stores N entries instead of a high-order > entry */ > for (i = 0; i < entries; i++) { > + set_page_private(folio_page(newfolio, i), > + folio_page(folio, i)->private); I see that the expectation of page_tail->private should be zero for __split_huge_page_tail. So, this patch also wrong :( . > xas_store(&xas, newfolio); > xas_next(&xas); > } > > > > On 12/14/2023 10:28 AM, Matthew Wilcox (Oracle) wrote: >> From: Charan Teja Kalla <quic_charante@quicinc.com> >> >> Large folios occupy N consecutive entries in the swap cache >> instead of using multi-index entries like the page cache. >> However, if a large folio is re-added to the LRU list, it can >> be migrated. The migration code was not aware of the difference >> between the swap cache and the page cache and assumed that a single >> xas_store() would be sufficient. >> >> This leaves potentially many stale pointers to the now-migrated folio >> in the swap cache, which can lead to almost arbitrary data corruption >> in the future. This can also manifest as infinite loops with the >> RCU read lock held. >> >> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> >> [modifications to the changelog & tweaked the fix] >> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> >> --- >> mm/migrate.c | 9 ++++++++- >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> index d9d2b9432e81..2d67ca47d2e2 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -405,6 +405,7 @@ int folio_migrate_mapping(struct address_space *mapping, >> int dirty; >> int expected_count = folio_expected_refs(mapping, folio) + extra_count; >> long nr = folio_nr_pages(folio); >> + long entries, i; >> >> if (!mapping) { >> /* Anonymous page without mapping */ >> @@ -442,8 +443,10 @@ int folio_migrate_mapping(struct address_space *mapping, >> folio_set_swapcache(newfolio); >> newfolio->private = folio_get_private(folio); >> } >> + entries = nr; >> } else { >> VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); >> + entries = 1; >> } >> >> /* Move dirty while page refs frozen and newpage not yet exposed */ >> @@ -453,7 +456,11 @@ int folio_migrate_mapping(struct address_space *mapping, >> folio_set_dirty(newfolio); >> } >> >> - xas_store(&xas, newfolio); >> + /* Swap cache still stores N entries instead of a high-order entry */ >> + for (i = 0; i < entries; i++) { >> + xas_store(&xas, newfolio); >> + xas_next(&xas); >> + } >> >> /* >> * Drop cache reference from old page by unfreezing
diff --git a/mm/migrate.c b/mm/migrate.c index d9d2b9432e81..2d67ca47d2e2 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -405,6 +405,7 @@ int folio_migrate_mapping(struct address_space *mapping, int dirty; int expected_count = folio_expected_refs(mapping, folio) + extra_count; long nr = folio_nr_pages(folio); + long entries, i; if (!mapping) { /* Anonymous page without mapping */ @@ -442,8 +443,10 @@ int folio_migrate_mapping(struct address_space *mapping, folio_set_swapcache(newfolio); newfolio->private = folio_get_private(folio); } + entries = nr; } else { VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); + entries = 1; } /* Move dirty while page refs frozen and newpage not yet exposed */ @@ -453,7 +456,11 @@ int folio_migrate_mapping(struct address_space *mapping, folio_set_dirty(newfolio); } - xas_store(&xas, newfolio); + /* Swap cache still stores N entries instead of a high-order entry */ + for (i = 0; i < entries; i++) { + xas_store(&xas, newfolio); + xas_next(&xas); + } /* * Drop cache reference from old page by unfreezing