Message ID | 20230404120117.2562166-5-stevensd@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/khugepaged: fixes for khugepaged+shmem | expand |
On Tue, Apr 04, 2023 at 09:01:17PM +0900, David Stevens wrote: > From: David Stevens <stevensd@chromium.org> > > Make sure that collapse_file doesn't interfere with checking the > uptodate flag in the page cache by only inserting hpage into the page > cache after it has been updated and marked uptodate. This is achieved by > simply not replacing present pages with hpage when iterating over the > target range. > > The present pages are already locked, so replacing them with the locked > hpage before the collapse is finalized is unnecessary. However, it is > necessary to stop freezing the present pages after validating them, > since leaving long-term frozen pages in the page cache can lead to > deadlocks. Simply checking the reference count is sufficient to ensure > that there are no long-term references hanging around that would the > collapse would break. Similar to hpage, there is no reason that the > present pages actually need to be frozen in addition to being locked. > > This fixes a race where folio_seek_hole_data would mistake hpage for > an fallocated but unwritten page. This race is visible to userspace via > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes > a similar race where pages could temporarily disappear from mincore. > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > Signed-off-by: David Stevens <stevensd@chromium.org> > --- > mm/khugepaged.c | 79 ++++++++++++++++++------------------------------- > 1 file changed, 29 insertions(+), 50 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 7679551e9540..a19aa140fd52 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1855,17 +1855,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > * > * Basic scheme is simple, details are more complex: > * - allocate and lock a new huge page; > - * - scan page cache replacing old pages with the new one > + * - scan page cache, locking old pages > * + swap/gup in pages if necessary; > - * + keep old pages around in case rollback is required; > + * - copy data to new page > + * - handle shmem holes > + * + re-validate that holes weren't filled by someone else > + * + check for userfaultfd PS: some of the changes may belong to previous patch here, but not necessary to repost only for this, just in case there'll be a new one. > * - finalize updates to the page cache; > * - if replacing succeeds: > - * + copy data over; > - * + free old pages; > * + unlock huge page; > + * + free old pages; > * - if replacing failed; > - * + put all pages back and unfreeze them; > - * + restore gaps in the page cache; > + * + unlock old pages > * + unlock and free huge page; > */ > static int collapse_file(struct mm_struct *mm, unsigned long addr, > @@ -1913,12 +1914,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > } > } while (1); > > - /* > - * At this point the hpage is locked and not up-to-date. > - * It's safe to insert it into the page cache, because nobody would > - * be able to map it or use it in another way until we unlock it. > - */ > - > xas_set(&xas, start); > for (index = start; index < end; index++) { > page = xas_next(&xas); > @@ -2076,12 +2071,16 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > VM_BUG_ON_PAGE(page != xas_load(&xas), page); > > /* > - * The page is expected to have page_count() == 3: > + * We control three references to the page: > * - we hold a pin on it; > * - one reference from page cache; > * - one from isolate_lru_page; > + * If those are the only references, then any new usage of the > + * page will have to fetch it from the page cache. That requires > + * locking the page to handle truncate, so any new usage will be > + * blocked until we unlock page after collapse/during rollback. > */ > - if (!page_ref_freeze(page, 3)) { > + if (page_count(page) != 3) { > result = SCAN_PAGE_COUNT; > xas_unlock_irq(&xas); > putback_lru_page(page); Personally I don't see anything wrong with this change to resolve the dead lock. E.g. fast gup race right before unmapping the pgtables seems fine, since we'll just bail out with >3 refcounts (or fast-gup bails out by checking pte changes). Either way looks fine here. So far it looks good to me, but that may not mean much per the history on what I can overlook. It'll be always good to hear from Hugh and others.
On Tue, 4 Apr 2023, Peter Xu wrote: > On Tue, Apr 04, 2023 at 09:01:17PM +0900, David Stevens wrote: > > From: David Stevens <stevensd@chromium.org> > > > > Make sure that collapse_file doesn't interfere with checking the > > uptodate flag in the page cache by only inserting hpage into the page > > cache after it has been updated and marked uptodate. This is achieved by > > simply not replacing present pages with hpage when iterating over the > > target range. > > > > The present pages are already locked, so replacing them with the locked > > hpage before the collapse is finalized is unnecessary. However, it is > > necessary to stop freezing the present pages after validating them, > > since leaving long-term frozen pages in the page cache can lead to > > deadlocks. Simply checking the reference count is sufficient to ensure > > that there are no long-term references hanging around that would the > > collapse would break. Similar to hpage, there is no reason that the > > present pages actually need to be frozen in addition to being locked. > > > > This fixes a race where folio_seek_hole_data would mistake hpage for > > an fallocated but unwritten page. This race is visible to userspace via > > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes > > a similar race where pages could temporarily disappear from mincore. > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > > Signed-off-by: David Stevens <stevensd@chromium.org> ... > > Personally I don't see anything wrong with this change to resolve the dead > lock. E.g. fast gup race right before unmapping the pgtables seems fine, > since we'll just bail out with >3 refcounts (or fast-gup bails out by > checking pte changes). Either way looks fine here. > > So far it looks good to me, but that may not mean much per the history on > what I can overlook. It'll be always good to hear from Hugh and others. I'm uneasy about it, and haven't let it sink in for long enough: but haven't spotted anything wrong with it, nor experienced any trouble. I would have much preferred David to stick with the current scheme, and fix up seek_hole_data, and be less concerned with the mincore transients: this patch makes a significant change that is difficult to be sure of. I was dubious about the unfrozen "page_count(page) != 3" check (where another task can grab a reference an instant later), but perhaps it does serve a purpose, since we hold the page lock there: excludes concurrent shmem reads which grab but drop page lock before copying (though it's not clear that those do actually need excluding). I had thought shmem was peculiar in relying on page lock while writing, but turn out to be quite wrong about that: most filesystems rely on page lock while writing, though I'm not sure whether that's true of all (and it doesn't matter while collapse of non-shmem file is only permitted on read-only). We shall see. Hugh
Hi, On 2023-04-04 21:01:17 +0900, David Stevens wrote: > From: David Stevens <stevensd@chromium.org> > > Make sure that collapse_file doesn't interfere with checking the > uptodate flag in the page cache by only inserting hpage into the page > cache after it has been updated and marked uptodate. This is achieved by > simply not replacing present pages with hpage when iterating over the > target range. > > The present pages are already locked, so replacing them with the locked > hpage before the collapse is finalized is unnecessary. However, it is > necessary to stop freezing the present pages after validating them, > since leaving long-term frozen pages in the page cache can lead to > deadlocks. Simply checking the reference count is sufficient to ensure > that there are no long-term references hanging around that would the > collapse would break. Similar to hpage, there is no reason that the > present pages actually need to be frozen in addition to being locked. > > This fixes a race where folio_seek_hole_data would mistake hpage for > an fallocated but unwritten page. This race is visible to userspace via > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes > a similar race where pages could temporarily disappear from mincore. > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > Signed-off-by: David Stevens <stevensd@chromium.org> I noticed that recently MADV_COLLAPSE stopped being able to collapse a binary's executable code, always failing with EAGAIN. I bisected it down to a2e17cc2efc7 - this commit. Using perf trace -e 'huge_memory:*' -a I see 1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17) 1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) 1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17) 1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) 1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17) 1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) 1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17) 1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) for every attempt at doing madvise(MADV_COLLAPSE). I'm sad about that, because MADV_COLLAPSE was the first thing that allowed using huge pages for executable code that wasn't entirely completely gross. I don't yet have a standalone repro, but can write one if that's helpful. Greetings, Andres Freund
On Tue, Jun 20, 2023 at 01:55:47PM -0700, Andres Freund wrote: > Hi, Hi, Andres, > > On 2023-04-04 21:01:17 +0900, David Stevens wrote: > > From: David Stevens <stevensd@chromium.org> > > > > Make sure that collapse_file doesn't interfere with checking the > > uptodate flag in the page cache by only inserting hpage into the page > > cache after it has been updated and marked uptodate. This is achieved by > > simply not replacing present pages with hpage when iterating over the > > target range. > > > > The present pages are already locked, so replacing them with the locked > > hpage before the collapse is finalized is unnecessary. However, it is > > necessary to stop freezing the present pages after validating them, > > since leaving long-term frozen pages in the page cache can lead to > > deadlocks. Simply checking the reference count is sufficient to ensure > > that there are no long-term references hanging around that would the > > collapse would break. Similar to hpage, there is no reason that the > > present pages actually need to be frozen in addition to being locked. > > > > This fixes a race where folio_seek_hole_data would mistake hpage for > > an fallocated but unwritten page. This race is visible to userspace via > > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes > > a similar race where pages could temporarily disappear from mincore. > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > > Signed-off-by: David Stevens <stevensd@chromium.org> > > I noticed that recently MADV_COLLAPSE stopped being able to collapse a > binary's executable code, always failing with EAGAIN. I bisected it down to > a2e17cc2efc7 - this commit. > > Using perf trace -e 'huge_memory:*' -a I see > > 1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17) > 1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) > 1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17) > 1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) > 1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17) > 1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) > 1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17) > 1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) > > for every attempt at doing madvise(MADV_COLLAPSE). > > > I'm sad about that, because MADV_COLLAPSE was the first thing that allowed > using huge pages for executable code that wasn't entirely completely gross. > > > I don't yet have a standalone repro, but can write one if that's helpful. There's a fix: https://lore.kernel.org/all/20230607053135.2087354-1-stevensd@google.com/ Already in today's Andrew's pull for rc7: https://lore.kernel.org/all/20230620123828.813b1140d9c13af900e8edb3@linux-foundation.org/
Hi, On 2023-06-20 17:11:30 -0400, Peter Xu wrote: > On Tue, Jun 20, 2023 at 01:55:47PM -0700, Andres Freund wrote: > > On 2023-04-04 21:01:17 +0900, David Stevens wrote: > > > From: David Stevens <stevensd@chromium.org> > > > > > > Make sure that collapse_file doesn't interfere with checking the > > > uptodate flag in the page cache by only inserting hpage into the page > > > cache after it has been updated and marked uptodate. This is achieved by > > > simply not replacing present pages with hpage when iterating over the > > > target range. > > > > > > The present pages are already locked, so replacing them with the locked > > > hpage before the collapse is finalized is unnecessary. However, it is > > > necessary to stop freezing the present pages after validating them, > > > since leaving long-term frozen pages in the page cache can lead to > > > deadlocks. Simply checking the reference count is sufficient to ensure > > > that there are no long-term references hanging around that would the > > > collapse would break. Similar to hpage, there is no reason that the > > > present pages actually need to be frozen in addition to being locked. > > > > > > This fixes a race where folio_seek_hole_data would mistake hpage for > > > an fallocated but unwritten page. This race is visible to userspace via > > > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes > > > a similar race where pages could temporarily disappear from mincore. > > > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") > > > Signed-off-by: David Stevens <stevensd@chromium.org> > > > > I noticed that recently MADV_COLLAPSE stopped being able to collapse a > > binary's executable code, always failing with EAGAIN. I bisected it down to > > a2e17cc2efc7 - this commit. > > > > Using perf trace -e 'huge_memory:*' -a I see > > > > 1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17) > > 1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) > > 1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17) > > 1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) > > 1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17) > > 1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) > > 1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17) > > 1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17) > > > > for every attempt at doing madvise(MADV_COLLAPSE). > > > > > > I'm sad about that, because MADV_COLLAPSE was the first thing that allowed > > using huge pages for executable code that wasn't entirely completely gross. > > > > > > I don't yet have a standalone repro, but can write one if that's helpful. > > There's a fix: > > https://lore.kernel.org/all/20230607053135.2087354-1-stevensd@google.com/ > > Already in today's Andrew's pull for rc7: > > https://lore.kernel.org/all/20230620123828.813b1140d9c13af900e8edb3@linux-foundation.org/ Ah, great! I can confirm that the fix unbreaks our use of MADV_COLLAPSE for executable code... Greetings, Andres Freund
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 7679551e9540..a19aa140fd52 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1855,17 +1855,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, * * Basic scheme is simple, details are more complex: * - allocate and lock a new huge page; - * - scan page cache replacing old pages with the new one + * - scan page cache, locking old pages * + swap/gup in pages if necessary; - * + keep old pages around in case rollback is required; + * - copy data to new page + * - handle shmem holes + * + re-validate that holes weren't filled by someone else + * + check for userfaultfd * - finalize updates to the page cache; * - if replacing succeeds: - * + copy data over; - * + free old pages; * + unlock huge page; + * + free old pages; * - if replacing failed; - * + put all pages back and unfreeze them; - * + restore gaps in the page cache; + * + unlock old pages * + unlock and free huge page; */ static int collapse_file(struct mm_struct *mm, unsigned long addr, @@ -1913,12 +1914,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, } } while (1); - /* - * At this point the hpage is locked and not up-to-date. - * It's safe to insert it into the page cache, because nobody would - * be able to map it or use it in another way until we unlock it. - */ - xas_set(&xas, start); for (index = start; index < end; index++) { page = xas_next(&xas); @@ -2076,12 +2071,16 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, VM_BUG_ON_PAGE(page != xas_load(&xas), page); /* - * The page is expected to have page_count() == 3: + * We control three references to the page: * - we hold a pin on it; * - one reference from page cache; * - one from isolate_lru_page; + * If those are the only references, then any new usage of the + * page will have to fetch it from the page cache. That requires + * locking the page to handle truncate, so any new usage will be + * blocked until we unlock page after collapse/during rollback. */ - if (!page_ref_freeze(page, 3)) { + if (page_count(page) != 3) { result = SCAN_PAGE_COUNT; xas_unlock_irq(&xas); putback_lru_page(page); @@ -2089,13 +2088,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, } /* - * Add the page to the list to be able to undo the collapse if - * something go wrong. + * Accumulate the pages that are being collapsed. */ list_add_tail(&page->lru, &pagelist); - - /* Finally, replace with the new page. */ - xas_store(&xas, hpage); continue; out_unlock: unlock_page(page); @@ -2132,8 +2127,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, goto rollback; /* - * Replacing old pages with new one has succeeded, now we - * attempt to copy the contents. + * The old pages are locked, so they won't change anymore. */ index = start; list_for_each_entry(page, &pagelist, lru) { @@ -2222,11 +2216,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, /* nr_none is always 0 for non-shmem. */ __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none); } - /* Join all the small entries into a single multi-index entry. */ - xas_set_order(&xas, start, HPAGE_PMD_ORDER); - xas_store(&xas, hpage); - xas_unlock_irq(&xas); + /* + * Mark hpage as uptodate before inserting it into the page cache so + * that it isn't mistaken for an fallocated but unwritten page. + */ folio = page_folio(hpage); folio_mark_uptodate(folio); folio_ref_add(folio, HPAGE_PMD_NR - 1); @@ -2235,6 +2229,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, folio_mark_dirty(folio); folio_add_lru(folio); + /* Join all the small entries into a single multi-index entry. */ + xas_set_order(&xas, start, HPAGE_PMD_ORDER); + xas_store(&xas, hpage); + xas_unlock_irq(&xas); + /* * Remove pte page tables, so we can re-fault the page as huge. */ @@ -2248,47 +2247,29 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, list_for_each_entry_safe(page, tmp, &pagelist, lru) { list_del(&page->lru); page->mapping = NULL; - page_ref_unfreeze(page, 1); ClearPageActive(page); ClearPageUnevictable(page); unlock_page(page); - put_page(page); + folio_put_refs(page_folio(page), 3); } goto out; rollback: /* Something went wrong: roll back page cache changes */ - xas_lock_irq(&xas); if (nr_none) { + xas_lock_irq(&xas); mapping->nrpages -= nr_none; shmem_uncharge(mapping->host, nr_none); + xas_unlock_irq(&xas); } - xas_set(&xas, start); - end = index; - for (index = start; index < end; index++) { - xas_next(&xas); - page = list_first_entry_or_null(&pagelist, - struct page, lru); - if (!page || xas.xa_index < page->index) { - nr_none--; - continue; - } - - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); - - /* Unfreeze the page. */ + list_for_each_entry_safe(page, tmp, &pagelist, lru) { list_del(&page->lru); - page_ref_unfreeze(page, 2); - xas_store(&xas, page); - xas_pause(&xas); - xas_unlock_irq(&xas); unlock_page(page); putback_lru_page(page); - xas_lock_irq(&xas); + put_page(page); } - VM_BUG_ON(nr_none); /* * Undo the updates of filemap_nr_thps_inc for non-SHMEM * file only. This undo is not needed unless failure is @@ -2303,8 +2284,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, smp_mb(); } - xas_unlock_irq(&xas); - hpage->mapping = NULL; unlock_page(hpage);