[mmotm] tmpfs: do not allocate pages on read

Message ID	f9c2f38f-5eb8-5d30-40fa-93e88b5fbc51@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Fri, 4 Mar 2022 21:09:01 -0800 (PST) From: Hugh Dickins <hughd@google.com> To: Andrew Morton <akpm@linux-foundation.org> cc: Mikulas Patocka <mpatocka@redhat.com>, Zdenek Kabelac <zkabelac@redhat.com>, Lukas Czerner <lczerner@redhat.com>, "Darrick J. Wong" <djwong@kernel.org>, Miklos Szeredi <miklos@szeredi.hu>, Christoph Hellwig <hch@lst.de>, Borislav Petkov <bp@suse.de>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH mmotm] tmpfs: do not allocate pages on read Message-ID: <f9c2f38f-5eb8-5d30-40fa-93e88b5fbc51@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Precedence: bulk
Series	[mmotm] tmpfs: do not allocate pages on read \| expand [mmotm] tmpfs: do not allocate pages on read

Message ID

f9c2f38f-5eb8-5d30-40fa-93e88b5fbc51@google.com (mailing list archive)

State

New, archived

Headers

Date: Fri, 4 Mar 2022 21:09:01 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
cc: Mikulas Patocka <mpatocka@redhat.com>,
        Zdenek Kabelac <zkabelac@redhat.com>,
        Lukas Czerner <lczerner@redhat.com>,
        "Darrick J. Wong" <djwong@kernel.org>,
        Miklos Szeredi <miklos@szeredi.hu>,
        Christoph Hellwig <hch@lst.de>, Borislav Petkov <bp@suse.de>,
        linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
        linux-mm@kvack.org
Subject: [PATCH mmotm] tmpfs: do not allocate pages on read
Message-ID: <f9c2f38f-5eb8-5d30-40fa-93e88b5fbc51@google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk

Series

[mmotm] tmpfs: do not allocate pages on read | expand

Commit Message

Hugh Dickins March 5, 2022, 5:09 a.m. UTC

Mikulas asked in
https://lore.kernel.org/linux-mm/alpine.LRH.2.02.2007210510230.6959@file01.intranet.prod.int.rdu2.redhat.com/
Do we still need a0ee5ec520ed ("tmpfs: allocate on read when stacked")?

Lukas noticed this unusual behavior of loop device backed by tmpfs in
https://lore.kernel.org/linux-mm/20211126075100.gd64odg2bcptiqeb@work/

Normally, shmem_file_read_iter() copies the ZERO_PAGE when reading holes;
but if it looks like it might be a read for "a stacking filesystem", it
allocates actual pages to the page cache, and even marks them as dirty.
And reads from the loop device do satisfy the test that is used.

This oddity was added for an old version of unionfs, to help to limit
its usage to the limited size of the tmpfs mount involved; but about
the same time as the tmpfs mod went in (2.6.25), unionfs was reworked
to proceed differently; and the mod kept just in case others needed it.

Do we still need it? I cannot answer with more certainty than "Probably
not". It's nasty enough that we really should try to delete it; but if
a regression is reported somewhere, then we might have to revert later.

It's not quite as simple as just removing the test (as Mikulas did):
xfstests generic/013 hung because splice from tmpfs failed on page not
up-to-date and page mapping unset.  That can be fixed just by marking
the ZERO_PAGE as Uptodate, which of course it is; doing so here in
shmem_file_read_iter() is distasteful, but seems to be the best way.

My intention, though, was to stop using the ZERO_PAGE here altogether:
surely iov_iter_zero() is better for this case?  Sadly not: it relies
on clear_user(), and the x86 clear_user() is slower than its copy_user():
https://lore.kernel.org/lkml/2f5ca5e4-e250-a41c-11fb-a7f4ebc7e1c9@google.com/

But while we are still using the ZERO_PAGE, let's stop dirtying its
struct page cacheline with unnecessary get_page() and put_page().

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/shmem.c | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

Comments

Christoph Hellwig March 6, 2022, 9:27 a.m. UTC | #1

On Fri, Mar 04, 2022 at 09:09:01PM -0800, Hugh Dickins wrote:
> It's not quite as simple as just removing the test (as Mikulas did):
> xfstests generic/013 hung because splice from tmpfs failed on page not
> up-to-date and page mapping unset.  That can be fixed just by marking
> the ZERO_PAGE as Uptodate, which of course it is; doing so here in
> shmem_file_read_iter() is distasteful, but seems to be the best way.

Shouldn't we set ZERO_PAGE uptodate during early init code as it, uh,
is per definition uptodate all the time?

> 
> My intention, though, was to stop using the ZERO_PAGE here altogether:
> surely iov_iter_zero() is better for this case?  Sadly not: it relies
> on clear_user(), and the x86 clear_user() is slower than its copy_user():
> https://lore.kernel.org/lkml/2f5ca5e4-e250-a41c-11fb-a7f4ebc7e1c9@google.com/

Oh, that's sad as just using clear_user would be the right thing to
here.

> But while we are still using the ZERO_PAGE, let's stop dirtying its
> struct page cacheline with unnecessary get_page() and put_page().
> 
> Reported-by: Mikulas Patocka <mpatocka@redhat.com>
> Reported-by: Lukas Czerner <lczerner@redhat.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>

But except for maybe making sure that ZERO_PAGE is always marked
uptodate this does looks good to me.

Hugh Dickins March 6, 2022, 10:56 p.m. UTC | #2

On Sun, 6 Mar 2022, Christoph Hellwig wrote:
> On Fri, Mar 04, 2022 at 09:09:01PM -0800, Hugh Dickins wrote:
> > It's not quite as simple as just removing the test (as Mikulas did):
> > xfstests generic/013 hung because splice from tmpfs failed on page not
> > up-to-date and page mapping unset.  That can be fixed just by marking
> > the ZERO_PAGE as Uptodate, which of course it is; doing so here in
> > shmem_file_read_iter() is distasteful, but seems to be the best way.
> 
> Shouldn't we set ZERO_PAGE uptodate during early init code as it, uh,
> is per definition uptodate all the time?

You're right, that does look hacky there.  I was too unsure of when and
how the different architectures set up ZERO_PAGE, so kept away.  But
looking through, pagecache_init() seems late enough in initialization
and early enough in running, and an appropriate place to do it -
tmpfs may be the first to need it, but it could be useful to others.
Just on ZERO_PAGE(0), the one used all over: never mind the other
colours of zero page, on those architectures which have multiple.

v2 coming up.

> 
> > 
> > My intention, though, was to stop using the ZERO_PAGE here altogether:
> > surely iov_iter_zero() is better for this case?  Sadly not: it relies
> > on clear_user(), and the x86 clear_user() is slower than its copy_user():
> > https://lore.kernel.org/lkml/2f5ca5e4-e250-a41c-11fb-a7f4ebc7e1c9@google.com/
> 
> Oh, that's sad as just using clear_user would be the right thing to
> here.
> 
> > But while we are still using the ZERO_PAGE, let's stop dirtying its
> > struct page cacheline with unnecessary get_page() and put_page().
> > 
> > Reported-by: Mikulas Patocka <mpatocka@redhat.com>
> > Reported-by: Lukas Czerner <lczerner@redhat.com>
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> But except for maybe making sure that ZERO_PAGE is always marked
> uptodate this does looks good to me.

Thanks a lot for looking through.

Hugh

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2501,19 +2501,10 @@  static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct address_space *mapping = inode->i_mapping;
 	pgoff_t index;
 	unsigned long offset;
-	enum sgp_type sgp = SGP_READ;
 	int error = 0;
 	ssize_t retval = 0;
 	loff_t *ppos = &iocb->ki_pos;
 
-	/*
-	 * Might this read be for a stacking filesystem?  Then when reading
-	 * holes of a sparse file, we actually need to allocate those pages,
-	 * and even mark them dirty, so it cannot exceed the max_blocks limit.
-	 */
-	if (!iter_is_iovec(to))
-		sgp = SGP_CACHE;
-
 	index = *ppos >> PAGE_SHIFT;
 	offset = *ppos & ~PAGE_MASK;
 
@@ -2522,6 +2513,7 @@  static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		pgoff_t end_index;
 		unsigned long nr, ret;
 		loff_t i_size = i_size_read(inode);
+		bool got_page;
 
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
@@ -2532,15 +2524,13 @@  static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 				break;
 		}
 
-		error = shmem_getpage(inode, index, &page, sgp);
+		error = shmem_getpage(inode, index, &page, SGP_READ);
 		if (error) {
 			if (error == -EINVAL)
 				error = 0;
 			break;
 		}
 		if (page) {
-			if (sgp == SGP_CACHE)
-				set_page_dirty(page);
 			unlock_page(page);
 
 			if (PageHWPoison(page)) {
@@ -2580,9 +2570,15 @@  static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			 */
 			if (!offset)
 				mark_page_accessed(page);
+			got_page = true;
 		} else {
 			page = ZERO_PAGE(0);
-			get_page(page);
+			/*
+			 * Let splice page_cache_pipe_buf_confirm() succeed.
+			 */
+			if (!PageUptodate(page))
+				SetPageUptodate(page);
+			got_page = false;
 		}
 
 		/*
@@ -2595,7 +2591,8 @@  static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		index += offset >> PAGE_SHIFT;
 		offset &= ~PAGE_MASK;
 
-		put_page(page);
+		if (got_page)
+			put_page(page);
 		if (!iov_iter_count(to))
 			break;
 		if (ret < nr) {

[mmotm] tmpfs: do not allocate pages on read

Commit Message

Comments

Patch