[v6,4/4] mm/khugepaged: maintain page cache uptodate flag

Message ID	20230404120117.2562166-5-stevensd@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: David Stevens <stevensd@chromium.org> To: linux-mm@kvack.org, Peter Xu <peterx@redhat.com>, Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, "Kirill A . Shutemov" <kirill@shutemov.name>, Yang Shi <shy828301@gmail.com>, David Hildenbrand <david@redhat.com>, Jiaqi Yan <jiaqiyan@google.com>, linux-kernel@vger.kernel.org, David Stevens <stevensd@chromium.org> Subject: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag Date: Tue, 4 Apr 2023 21:01:17 +0900 Message-Id: <20230404120117.2562166-5-stevensd@google.com> In-Reply-To: <20230404120117.2562166-1-stevensd@google.com> References: <20230404120117.2562166-1-stevensd@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/khugepaged: fixes for khugepaged+shmem \| expand [v6,0/4] mm/khugepaged: fixes for khugepaged+shmem [v6,1/4] mm/khugepaged: drain lru after swapping in shmem [v6,2/4] mm/khugepaged: refactor collapse_file control flow [v6,3/4] mm/khugepaged: skip shmem with userfaultfd [v6,4/4] mm/khugepaged: maintain page cache uptodate flag

David Stevens April 4, 2023, 12:01 p.m. UTC

From: David Stevens <stevensd@chromium.org>

Make sure that collapse_file doesn't interfere with checking the
uptodate flag in the page cache by only inserting hpage into the page
cache after it has been updated and marked uptodate. This is achieved by
simply not replacing present pages with hpage when iterating over the
target range.

The present pages are already locked, so replacing them with the locked
hpage before the collapse is finalized is unnecessary. However, it is
necessary to stop freezing the present pages after validating them,
since leaving long-term frozen pages in the page cache can lead to
deadlocks. Simply checking the reference count is sufficient to ensure
that there are no long-term references hanging around that would the
collapse would break. Similar to hpage, there is no reason that the
present pages actually need to be frozen in addition to being locked.

This fixes a race where folio_seek_hole_data would mistake hpage for
an fallocated but unwritten page. This race is visible to userspace via
data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
a similar race where pages could temporarily disappear from mincore.

Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: David Stevens <stevensd@chromium.org>
---
 mm/khugepaged.c | 79 ++++++++++++++++++-------------------------------
 1 file changed, 29 insertions(+), 50 deletions(-)

Peter Xu April 4, 2023, 9:21 p.m. UTC | #1

On Tue, Apr 04, 2023 at 09:01:17PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Make sure that collapse_file doesn't interfere with checking the
> uptodate flag in the page cache by only inserting hpage into the page
> cache after it has been updated and marked uptodate. This is achieved by
> simply not replacing present pages with hpage when iterating over the
> target range.
> 
> The present pages are already locked, so replacing them with the locked
> hpage before the collapse is finalized is unnecessary. However, it is
> necessary to stop freezing the present pages after validating them,
> since leaving long-term frozen pages in the page cache can lead to
> deadlocks. Simply checking the reference count is sufficient to ensure
> that there are no long-term references hanging around that would the
> collapse would break. Similar to hpage, there is no reason that the
> present pages actually need to be frozen in addition to being locked.
> 
> This fixes a race where folio_seek_hole_data would mistake hpage for
> an fallocated but unwritten page. This race is visible to userspace via
> data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> a similar race where pages could temporarily disappear from mincore.
> 
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  mm/khugepaged.c | 79 ++++++++++++++++++-------------------------------
>  1 file changed, 29 insertions(+), 50 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 7679551e9540..a19aa140fd52 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1855,17 +1855,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
>   *
>   * Basic scheme is simple, details are more complex:
>   *  - allocate and lock a new huge page;
> - *  - scan page cache replacing old pages with the new one
> + *  - scan page cache, locking old pages
>   *    + swap/gup in pages if necessary;
> - *    + keep old pages around in case rollback is required;
> + *  - copy data to new page
> + *  - handle shmem holes
> + *    + re-validate that holes weren't filled by someone else
> + *    + check for userfaultfd

PS: some of the changes may belong to previous patch here, but not
necessary to repost only for this, just in case there'll be a new one.

>   *  - finalize updates to the page cache;
>   *  - if replacing succeeds:
> - *    + copy data over;
> - *    + free old pages;
>   *    + unlock huge page;
> + *    + free old pages;
>   *  - if replacing failed;
> - *    + put all pages back and unfreeze them;
> - *    + restore gaps in the page cache;
> + *    + unlock old pages
>   *    + unlock and free huge page;
>   */
>  static int collapse_file(struct mm_struct *mm, unsigned long addr,
> @@ -1913,12 +1914,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		}
>  	} while (1);
>  
> -	/*
> -	 * At this point the hpage is locked and not up-to-date.
> -	 * It's safe to insert it into the page cache, because nobody would
> -	 * be able to map it or use it in another way until we unlock it.
> -	 */
> -
>  	xas_set(&xas, start);
>  	for (index = start; index < end; index++) {
>  		page = xas_next(&xas);
> @@ -2076,12 +2071,16 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		VM_BUG_ON_PAGE(page != xas_load(&xas), page);
>  
>  		/*
> -		 * The page is expected to have page_count() == 3:
> +		 * We control three references to the page:
>  		 *  - we hold a pin on it;
>  		 *  - one reference from page cache;
>  		 *  - one from isolate_lru_page;
> +		 * If those are the only references, then any new usage of the
> +		 * page will have to fetch it from the page cache. That requires
> +		 * locking the page to handle truncate, so any new usage will be
> +		 * blocked until we unlock page after collapse/during rollback.
>  		 */
> -		if (!page_ref_freeze(page, 3)) {
> +		if (page_count(page) != 3) {
>  			result = SCAN_PAGE_COUNT;
>  			xas_unlock_irq(&xas);
>  			putback_lru_page(page);

Personally I don't see anything wrong with this change to resolve the dead
lock.  E.g. fast gup race right before unmapping the pgtables seems fine,
since we'll just bail out with >3 refcounts (or fast-gup bails out by
checking pte changes).  Either way looks fine here.

So far it looks good to me, but that may not mean much per the history on
what I can overlook.  It'll be always good to hear from Hugh and others.

Hugh Dickins April 19, 2023, 4:37 a.m. UTC | #2

On Tue, 4 Apr 2023, Peter Xu wrote:
> On Tue, Apr 04, 2023 at 09:01:17PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> > 
> > Make sure that collapse_file doesn't interfere with checking the
> > uptodate flag in the page cache by only inserting hpage into the page
> > cache after it has been updated and marked uptodate. This is achieved by
> > simply not replacing present pages with hpage when iterating over the
> > target range.
> > 
> > The present pages are already locked, so replacing them with the locked
> > hpage before the collapse is finalized is unnecessary. However, it is
> > necessary to stop freezing the present pages after validating them,
> > since leaving long-term frozen pages in the page cache can lead to
> > deadlocks. Simply checking the reference count is sufficient to ensure
> > that there are no long-term references hanging around that would the
> > collapse would break. Similar to hpage, there is no reason that the
> > present pages actually need to be frozen in addition to being locked.
> > 
> > This fixes a race where folio_seek_hole_data would mistake hpage for
> > an fallocated but unwritten page. This race is visible to userspace via
> > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> > a similar race where pages could temporarily disappear from mincore.
> > 
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Signed-off-by: David Stevens <stevensd@chromium.org>
...
> 
> Personally I don't see anything wrong with this change to resolve the dead
> lock.  E.g. fast gup race right before unmapping the pgtables seems fine,
> since we'll just bail out with >3 refcounts (or fast-gup bails out by
> checking pte changes).  Either way looks fine here.
> 
> So far it looks good to me, but that may not mean much per the history on
> what I can overlook.  It'll be always good to hear from Hugh and others.

I'm uneasy about it, and haven't let it sink in for long enough: but
haven't spotted anything wrong with it, nor experienced any trouble.

I would have much preferred David to stick with the current scheme, and
fix up seek_hole_data, and be less concerned with the mincore transients:
this patch makes a significant change that is difficult to be sure of.

I was dubious about the unfrozen "page_count(page) != 3" check (where
another task can grab a reference an instant later), but perhaps it
does serve a purpose, since we hold the page lock there: excludes
concurrent shmem reads which grab but drop page lock before copying
(though it's not clear that those do actually need excluding).

I had thought shmem was peculiar in relying on page lock while writing,
but turn out to be quite wrong about that: most filesystems rely on
page lock while writing, though I'm not sure whether that's true of
all (and it doesn't matter while collapse of non-shmem file is only
permitted on read-only).

We shall see.

Hugh

Andres Freund June 20, 2023, 8:55 p.m. UTC | #3

Hi,

On 2023-04-04 21:01:17 +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Make sure that collapse_file doesn't interfere with checking the
> uptodate flag in the page cache by only inserting hpage into the page
> cache after it has been updated and marked uptodate. This is achieved by
> simply not replacing present pages with hpage when iterating over the
> target range.
> 
> The present pages are already locked, so replacing them with the locked
> hpage before the collapse is finalized is unnecessary. However, it is
> necessary to stop freezing the present pages after validating them,
> since leaving long-term frozen pages in the page cache can lead to
> deadlocks. Simply checking the reference count is sufficient to ensure
> that there are no long-term references hanging around that would the
> collapse would break. Similar to hpage, there is no reason that the
> present pages actually need to be frozen in addition to being locked.
> 
> This fixes a race where folio_seek_hole_data would mistake hpage for
> an fallocated but unwritten page. This race is visible to userspace via
> data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> a similar race where pages could temporarily disappear from mincore.
> 
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Signed-off-by: David Stevens <stevensd@chromium.org>

I noticed that recently MADV_COLLAPSE stopped being able to collapse a
binary's executable code, always failing with EAGAIN. I bisected it down to
a2e17cc2efc7 - this commit.

Using perf trace -e 'huge_memory:*' -a I see

  1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17)
  1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
  1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17)
  1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
  1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17)
  1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
  1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17)
  1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)

for every attempt at doing madvise(MADV_COLLAPSE).


I'm sad about that, because MADV_COLLAPSE was the first thing that allowed
using huge pages for executable code that wasn't entirely completely gross.


I don't yet have a standalone repro, but can write one if that's helpful.

Greetings,

Andres Freund

Peter Xu June 20, 2023, 9:11 p.m. UTC | #4

On Tue, Jun 20, 2023 at 01:55:47PM -0700, Andres Freund wrote:
> Hi,

Hi, Andres,

> 
> On 2023-04-04 21:01:17 +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> > 
> > Make sure that collapse_file doesn't interfere with checking the
> > uptodate flag in the page cache by only inserting hpage into the page
> > cache after it has been updated and marked uptodate. This is achieved by
> > simply not replacing present pages with hpage when iterating over the
> > target range.
> > 
> > The present pages are already locked, so replacing them with the locked
> > hpage before the collapse is finalized is unnecessary. However, it is
> > necessary to stop freezing the present pages after validating them,
> > since leaving long-term frozen pages in the page cache can lead to
> > deadlocks. Simply checking the reference count is sufficient to ensure
> > that there are no long-term references hanging around that would the
> > collapse would break. Similar to hpage, there is no reason that the
> > present pages actually need to be frozen in addition to being locked.
> > 
> > This fixes a race where folio_seek_hole_data would mistake hpage for
> > an fallocated but unwritten page. This race is visible to userspace via
> > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> > a similar race where pages could temporarily disappear from mincore.
> > 
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> 
> I noticed that recently MADV_COLLAPSE stopped being able to collapse a
> binary's executable code, always failing with EAGAIN. I bisected it down to
> a2e17cc2efc7 - this commit.
> 
> Using perf trace -e 'huge_memory:*' -a I see
> 
>   1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17)
>   1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
>   1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17)
>   1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
>   1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17)
>   1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
>   1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17)
>   1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> 
> for every attempt at doing madvise(MADV_COLLAPSE).
> 
> 
> I'm sad about that, because MADV_COLLAPSE was the first thing that allowed
> using huge pages for executable code that wasn't entirely completely gross.
> 
> 
> I don't yet have a standalone repro, but can write one if that's helpful.

There's a fix:

https://lore.kernel.org/all/20230607053135.2087354-1-stevensd@google.com/

Already in today's Andrew's pull for rc7:

https://lore.kernel.org/all/20230620123828.813b1140d9c13af900e8edb3@linux-foundation.org/

Andres Freund June 20, 2023, 9:41 p.m. UTC | #5

Hi,

On 2023-06-20 17:11:30 -0400, Peter Xu wrote:
> On Tue, Jun 20, 2023 at 01:55:47PM -0700, Andres Freund wrote:
> > On 2023-04-04 21:01:17 +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > > 
> > > Make sure that collapse_file doesn't interfere with checking the
> > > uptodate flag in the page cache by only inserting hpage into the page
> > > cache after it has been updated and marked uptodate. This is achieved by
> > > simply not replacing present pages with hpage when iterating over the
> > > target range.
> > > 
> > > The present pages are already locked, so replacing them with the locked
> > > hpage before the collapse is finalized is unnecessary. However, it is
> > > necessary to stop freezing the present pages after validating them,
> > > since leaving long-term frozen pages in the page cache can lead to
> > > deadlocks. Simply checking the reference count is sufficient to ensure
> > > that there are no long-term references hanging around that would the
> > > collapse would break. Similar to hpage, there is no reason that the
> > > present pages actually need to be frozen in addition to being locked.
> > > 
> > > This fixes a race where folio_seek_hole_data would mistake hpage for
> > > an fallocated but unwritten page. This race is visible to userspace via
> > > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> > > a similar race where pages could temporarily disappear from mincore.
> > > 
> > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > 
> > I noticed that recently MADV_COLLAPSE stopped being able to collapse a
> > binary's executable code, always failing with EAGAIN. I bisected it down to
> > a2e17cc2efc7 - this commit.
> > 
> > Using perf trace -e 'huge_memory:*' -a I see
> > 
> >   1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17)
> >   1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> >   1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17)
> >   1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> >   1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17)
> >   1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> >   1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17)
> >   1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> > 
> > for every attempt at doing madvise(MADV_COLLAPSE).
> > 
> > 
> > I'm sad about that, because MADV_COLLAPSE was the first thing that allowed
> > using huge pages for executable code that wasn't entirely completely gross.
> > 
> > 
> > I don't yet have a standalone repro, but can write one if that's helpful.
> 
> There's a fix:
> 
> https://lore.kernel.org/all/20230607053135.2087354-1-stevensd@google.com/
> 
> Already in today's Andrew's pull for rc7:
> 
> https://lore.kernel.org/all/20230620123828.813b1140d9c13af900e8edb3@linux-foundation.org/

Ah, great!

I can confirm that the fix unbreaks our use of MADV_COLLAPSE for executable
code...

Greetings,

Andres Freund

[v6,4/4] mm/khugepaged: maintain page cache uptodate flag

Commit Message

Comments

Patch