diff mbox series

[v17,03/14] shmem: Implement splice-read

Message ID 20230308165251.2078898-4-dhowells@redhat.com (mailing list archive)
State New
Headers show
Series splice, block: Use page pinning and kill ITER_PIPE | expand

Commit Message

David Howells March 8, 2023, 4:52 p.m. UTC
The new filemap_splice_read() has an implicit expectation via
filemap_get_pages() that ->read_folio() exists if ->readahead() doesn't
fully populate the pagecache of the file it is reading from[1], potentially
leading to a jump to NULL if this doesn't exist.  shmem, however, (and by
extension, tmpfs, ramfs and rootfs), doesn't have ->read_folio(),

Work around this by equipping shmem with its own splice-read
implementation, based on filemap_splice_read(), but able to paste in
zero_page when there's a page missing.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Daniel Golle <daniel@makrotopia.org>
cc: Guenter Roeck <groeck7@gmail.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Hugh Dickins <hughd@google.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/Y+pdHFFTk1TTEBsO@makrotopia.org/ [1]
---
 mm/shmem.c | 124 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 123 insertions(+), 1 deletion(-)

Comments

Linus Torvalds March 8, 2023, 10:39 p.m. UTC | #1
On Wed, Mar 8, 2023 at 8:53 AM David Howells <dhowells@redhat.com> wrote:
>
> The new filemap_splice_read() has an implicit expectation via
> filemap_get_pages() that ->read_folio() exists if ->readahead() doesn't
> fully populate the pagecache of the file it is reading from[1], potentially
> leading to a jump to NULL if this doesn't exist.  shmem, however, (and by
> extension, tmpfs, ramfs and rootfs), doesn't have ->read_folio(),

This patch is the only one in your series that I went "Ugh, that's
really ugly" for.

Do we really want to basically duplicate all of filemap_splice_read()?

I get the feeling that the zeropage case just isn't so important that
we'd need to duplicate filemap_splice_read() just for that, and I
think that the code should either

 (a) just make a silly "read_folio()" for shmfs that just clears the page.

     Ugly but maybe simple and not horrid?

or

 (b) teach filemap_splice_read() that a NULL 'read_folio' function
means "use the zero page"

     That might not be splice() itself, but maybe in
filemap_get_pages() or something.

or

 (c) go even further, and teach read_folio() in general about file
holes, and allow *any* filesystem to read zeroes that way in general
without creating a folio for it.

in a perfect world, if done well I think shmem_file_read_iter() should
go away, and it could use generic_file_read_iter too.

I dunno. Maybe shm really is *so* special that this is the right way
to do things, but I did react quite negatively to this patch. So not a
complete NAK, but definitely a "do we _really_ have to do this?"

                       Linus
David Howells March 8, 2023, 11:42 p.m. UTC | #2
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I get the feeling that the zeropage case just isn't so important that we'd
> need to duplicate filemap_splice_read() just for that, and I think that the
> code should either
> 
>  (a) just make a silly "read_folio()" for shmfs that just clears the page.
> 
>      Ugly but maybe simple and not horrid?

The problem with that is that once a page is in the pagecache attached to a
shmem file, we can't get rid of it without deleting or truncating the file...
At least, I think that the case.  For all other regular filesystems, a page
full of zeros can be flushed/discarded by the LRU.

shmem also has its own function for allocating folios in its pagecache, the
caller of ->read_folio() would probably have to use that.

>  (b) teach filemap_splice_read() that a NULL 'read_folio' function
> means "use the zero page"
> 
>      That might not be splice() itself, but maybe in
> filemap_get_pages() or something.

It would require some special handling in filemap_get_pages() - and/or
probably better filemap_splice_read() since, for shmem, it's only relevant to
splice.  An additional flag could be added to filemap_get_pages() to tell it
to stop at a hole in the pagecache rather than invoking readahead.
filemap_splice_read() would then need to examine the pagecache to work out how
big the hole is and insert the appropriate number of zeropages before calling
back into filemap_get_pages() again.  Possibly it could use SEEK_DATA.

> or
> 
>  (c) go even further, and teach read_folio() in general about file
> holes, and allow *any* filesystem to read zeroes that way in general
> without creating a folio for it.

Nice idea, but we'd need a way to store a "negative" marker (as opposed to
"unknown") in the pagecache for the filemap code to be able to use it.  This
sort of thing might become easier if xarray gets switched to a maple tree
implementation as that would better allow for caching of a known file hole of
arbitrary size with a single entry.

But for the moment, the filemap code would have to jump through a filesystem's
->readahead or ->read_folio vectors to work out if there's a hole there or not
- but in both cases it must already have allocated the pages it wants to
query.

David
Matthew Wilcox (Oracle) March 14, 2023, 4:42 p.m. UTC | #3
On Wed, Mar 08, 2023 at 02:39:00PM -0800, Linus Torvalds wrote:
> On Wed, Mar 8, 2023 at 8:53 AM David Howells <dhowells@redhat.com> wrote:
> >
> > The new filemap_splice_read() has an implicit expectation via
> > filemap_get_pages() that ->read_folio() exists if ->readahead() doesn't
> > fully populate the pagecache of the file it is reading from[1], potentially
> > leading to a jump to NULL if this doesn't exist.  shmem, however, (and by
> > extension, tmpfs, ramfs and rootfs), doesn't have ->read_folio(),
> 
> This patch is the only one in your series that I went "Ugh, that's
> really ugly" for.
> 
> Do we really want to basically duplicate all of filemap_splice_read()?
> 
> I get the feeling that the zeropage case just isn't so important that
> we'd need to duplicate filemap_splice_read() just for that, and I
> think that the code should either
> 
>  (a) just make a silly "read_folio()" for shmfs that just clears the page.
> 
>      Ugly but maybe simple and not horrid?

The problem is that we might have swapped out the shmem folio.  So we
don't want to clear the page, but ask swap to fill the page.  The way
that currently works (see shmem_get_folio_gfp()) is to fetch the swap
entry from the page cache, allocate a new folio inside the shmem code,
then replace the swap entry with the new folio.

What I'd like to see is the generic code say "Ah, this is a shmem
inode, so it's special and the xa_value entry is swap information,
not workingset information, so I'll allocate the folio and restore
the folio->private swap information to let the shmem_read_folio
function do its job correctly".

Either that or we completely overhaul the shmem code to store the
location of its swapped data somewhere that's not the page cache.

>  (b) teach filemap_splice_read() that a NULL 'read_folio' function
> means "use the zero page"

Same problem as (a).

>  (c) go even further, and teach read_folio() in general about file
> holes, and allow *any* filesystem to read zeroes that way in general
> without creating a folio for it.

I've had thoughts along those lines in the past.  It's pretty major
surgery, I think.  At the moment, we allocate the pages and add them
to the page cache in a locked state before asking the filesystem to
populate them.  So the fs doesn't even have the file layout (eg the
get_block or iomap info) that would tell it where the holes are until
the page has already been allocated and inserted.  We could of course
free the page and replace it with a special 'THIS_IS_A_HOLE' entry.
It's just never seemed important enuogh to me to do this surgery.

> in a perfect world, if done well I think shmem_file_read_iter() should
> go away, and it could use generic_file_read_iter too.
> 
> I dunno. Maybe shm really is *so* special that this is the right way
> to do things, but I did react quite negatively to this patch. So not a
> complete NAK, but definitely a "do we _really_ have to do this?"

I'd really like to see shmem have a read_folio implementation.  I
don't know how much work it's going to be.
Linus Torvalds March 14, 2023, 6:02 p.m. UTC | #4
On Tue, Mar 14, 2023 at 9:43 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> The problem is that we might have swapped out the shmem folio.  So we
> don't want to clear the page, but ask swap to fill the page.

Doesn't shmem_swapin_folio() already basically do all that work?

The real oddity with shmem - compared to other filesystems - is that
the xarray has a value entry instead of being a real folio. And yes,
the current filemap code will then just ignore such entries as
"doesn't exist", and so the regular read iterators will all fail on
it.

But while filemap_get_read_batch() will stop at a value-folio, I feel
like filemap_create_folio() should be able to turn a value page into a
"real" page. Right now it already allocates said page, but then I
think filemap_add_folio() will return -EEXIST when said entry exists
as a value.

But *if* instead of -EEXIST we could just replace the value with the
(already locked) page, and have some sane way to pass that value
(which is the swap entry data) to readpage(), I think that should just
do it all.

Admittedly I really don't know this area very well, so I may be
*entirely* out to lunch.

But the whole "teach the filemap code to actually react to XA value
entries" would be how I'd solve the hole issue too. So I think there
are commonalities here.

             Linus
               Linus
David Howells March 14, 2023, 6:26 p.m. UTC | #5
Hi Linus,

Are you okay if we go with my current patch for the moment?

David
Linus Torvalds March 14, 2023, 7:07 p.m. UTC | #6
On Tue, Mar 14, 2023 at 11:26 AM David Howells <dhowells@redhat.com> wrote:
>
> Are you okay if we go with my current patch for the moment?

I  guess.

But please at least stop doing the

     get_page(buf->page);

on the zero-page (which includes using no-op .get and .put functions
in  zero_pipe_buf_ops().

Maybe we can do /dev/null some day and actually have a common case for those.

             Linus
Linus Torvalds March 14, 2023, 7:09 p.m. UTC | #7
On Tue, Mar 14, 2023 at 12:07 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Maybe we can do /dev/null some day and actually have a common case for those.

/dev/zero, I mean. We already do splice to /dev/null (and splicing
from /dev/null isn't interesting ;)

            Linus
Matthew Wilcox (Oracle) March 14, 2023, 8:08 p.m. UTC | #8
On Tue, Mar 14, 2023 at 11:02:40AM -0700, Linus Torvalds wrote:
> On Tue, Mar 14, 2023 at 9:43 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > The problem is that we might have swapped out the shmem folio.  So we
> > don't want to clear the page, but ask swap to fill the page.
> 
> Doesn't shmem_swapin_folio() already basically do all that work?
> 
> The real oddity with shmem - compared to other filesystems - is that
> the xarray has a value entry instead of being a real folio. And yes,
> the current filemap code will then just ignore such entries as
> "doesn't exist", and so the regular read iterators will all fail on
> it.
> 
> But while filemap_get_read_batch() will stop at a value-folio, I feel
> like filemap_create_folio() should be able to turn a value page into a
> "real" page. Right now it already allocates said page, but then I
> think filemap_add_folio() will return -EEXIST when said entry exists
> as a value.
> 
> But *if* instead of -EEXIST we could just replace the value with the
> (already locked) page, and have some sane way to pass that value
> (which is the swap entry data) to readpage(), I think that should just
> do it all.

This was basically what I had in mind:

I don't think this will handle a case like:

Alloc order-0 folio at index 4
Alloc order-0 folio at index 7
Swap out both folios
Alloc order-9 folio at indices 0-511

But I don't see where shmem currently handles that either.  Maybe it
falls back to order-0 folios instead of the crude BUG_ON I put in.
Hugh?

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 82c1262f396f..30f2502314de 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -114,12 +114,6 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop,
 struct folio *shmem_read_folio_gfp(struct address_space *mapping,
 		pgoff_t index, gfp_t gfp);
 
-static inline struct folio *shmem_read_folio(struct address_space *mapping,
-		pgoff_t index)
-{
-	return shmem_read_folio_gfp(mapping, index, mapping_gfp_mask(mapping));
-}
-
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 57c1b154fb5a..8e4f95c5b65a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -877,6 +877,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 					order, gfp);
 		xas_lock_irq(&xas);
 		xas_for_each_conflict(&xas, entry) {
+			if (old)
+				BUG_ON(shmem_mapping(mapping));
 			old = entry;
 			if (!xa_is_value(entry)) {
 				xas_set_err(&xas, -EEXIST);
@@ -885,7 +887,12 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 		}
 
 		if (old) {
-			if (shadowp)
+			if (shmem_mapping(mapping)) {
+				folio_set_swap_entry(folio,
+						radix_to_swp_entry(old));
+				folio_set_swapcache(folio);
+				folio_set_swapbacked(folio);
+			} else if (shadowp)
 				*shadowp = old;
 			/* entry may have been split before we acquired lock */
 			order = xa_get_order(xas.xa, xas.xa_index);
diff --git a/mm/shmem.c b/mm/shmem.c
index 8e60826e4246..ea75c7dcf5ec 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2059,6 +2059,18 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop,
 			mapping_gfp_mask(inode->i_mapping), NULL, NULL, NULL);
 }
 
+static int shmem_read_folio(struct file *file, struct folio *folio)
+{
+	if (folio_test_swapcache(folio)) {
+		swap_readpage(&folio->page, true, NULL);
+	} else {
+		folio_zero_segment(folio, 0, folio_size(folio));
+		folio_mark_uptodate(folio);
+		folio_unlock(folio);
+	}
+	return 0;
+}
+
 /*
  * This is like autoremove_wake_function, but it removes the wait queue
  * entry unconditionally - even if something else had already woken the
@@ -2396,7 +2408,8 @@ static int shmem_fadvise_willneed(struct address_space *mapping,
 	xa_for_each_range(&mapping->i_pages, index, folio, start, end) {
 		if (!xa_is_value(folio))
 			continue;
-		folio = shmem_read_folio(mapping, index);
+		folio = shmem_read_folio_gfp(mapping, index,
+						mapping_gfp_mask(mapping));
 		if (!IS_ERR(folio))
 			folio_put(folio);
 	}
@@ -4037,6 +4050,7 @@ static int shmem_error_remove_page(struct address_space *mapping,
 }
 
 const struct address_space_operations shmem_aops = {
+	.read_folio	= shmem_read_folio,
 	.writepage	= shmem_writepage,
 	.dirty_folio	= noop_dirty_folio,
 #ifdef CONFIG_TMPFS
David Howells March 14, 2023, 9:50 p.m. UTC | #9
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> But please at least stop doing the
> 
>      get_page(buf->page);
> 
> on the zero-page (which includes using no-op .get and .put functions
> in  zero_pipe_buf_ops().

I'll make the attached change.  It seems to work.

David
---
diff --git a/mm/shmem.c b/mm/shmem.c
index 3cbec1d56112..d9b60ab556fe 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2719,6 +2719,17 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return retval ? retval : error;
 }
 
+static bool zero_pipe_buf_get(struct pipe_inode_info *pipe,
+			      struct pipe_buffer *buf)
+{
+	return true;
+}
+
+static void zero_pipe_buf_release(struct pipe_inode_info *pipe,
+				  struct pipe_buffer *buf)
+{
+}
+
 static bool zero_pipe_buf_try_steal(struct pipe_inode_info *pipe,
 				    struct pipe_buffer *buf)
 {
@@ -2726,9 +2737,9 @@ static bool zero_pipe_buf_try_steal(struct pipe_inode_info *pipe,
 }
 
 static const struct pipe_buf_operations zero_pipe_buf_ops = {
-	.release	= generic_pipe_buf_release,
+	.release	= zero_pipe_buf_release,
 	.try_steal	= zero_pipe_buf_try_steal,
-	.get		= generic_pipe_buf_get,
+	.get		= zero_pipe_buf_get,
 };
 
 static size_t splice_zeropage_into_pipe(struct pipe_inode_info *pipe,
diff mbox series

Patch

diff --git a/mm/shmem.c b/mm/shmem.c
index 448f393d8ab2..3cbec1d56112 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2719,6 +2719,128 @@  static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return retval ? retval : error;
 }
 
+static bool zero_pipe_buf_try_steal(struct pipe_inode_info *pipe,
+				    struct pipe_buffer *buf)
+{
+	return false;
+}
+
+static const struct pipe_buf_operations zero_pipe_buf_ops = {
+	.release	= generic_pipe_buf_release,
+	.try_steal	= zero_pipe_buf_try_steal,
+	.get		= generic_pipe_buf_get,
+};
+
+static size_t splice_zeropage_into_pipe(struct pipe_inode_info *pipe,
+					loff_t fpos, size_t size)
+{
+	size_t offset = fpos & ~PAGE_MASK;
+
+	size = min_t(size_t, size, PAGE_SIZE - offset);
+
+	if (!pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
+		struct pipe_buffer *buf = pipe_head_buf(pipe);
+
+		*buf = (struct pipe_buffer) {
+			.ops	= &zero_pipe_buf_ops,
+			.page	= ZERO_PAGE(0),
+			.offset	= offset,
+			.len	= size,
+		};
+		get_page(buf->page);
+		pipe->head++;
+	}
+
+	return size;
+}
+
+static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
+				      struct pipe_inode_info *pipe,
+				      size_t len, unsigned int flags)
+{
+	struct inode *inode = file_inode(in);
+	struct address_space *mapping = inode->i_mapping;
+	struct folio *folio = NULL;
+	size_t total_spliced = 0, used, npages, n, part;
+	loff_t isize;
+	int error = 0;
+
+	/* Work out how much data we can actually add into the pipe */
+	used = pipe_occupancy(pipe->head, pipe->tail);
+	npages = max_t(ssize_t, pipe->max_usage - used, 0);
+	len = min_t(size_t, len, npages * PAGE_SIZE);
+
+	do {
+		if (*ppos >= i_size_read(inode))
+			break;
+
+		error = shmem_get_folio(inode, *ppos / PAGE_SIZE, &folio, SGP_READ);
+		if (error) {
+			if (error == -EINVAL)
+				error = 0;
+			break;
+		}
+		if (folio) {
+			folio_unlock(folio);
+
+			if (folio_test_hwpoison(folio)) {
+				error = -EIO;
+				break;
+			}
+		}
+
+		/*
+		 * i_size must be checked after we know the pages are Uptodate.
+		 *
+		 * Checking i_size after the check allows us to calculate
+		 * the correct value for "nr", which means the zero-filled
+		 * part of the page is not copied back to userspace (unless
+		 * another truncate extends the file - this is desired though).
+		 */
+		isize = i_size_read(inode);
+		if (unlikely(*ppos >= isize))
+			break;
+		part = min_t(loff_t, isize - *ppos, len);
+
+		if (folio) {
+			/*
+			 * If users can be writing to this page using arbitrary
+			 * virtual addresses, take care about potential aliasing
+			 * before reading the page on the kernel side.
+			 */
+			if (mapping_writably_mapped(mapping))
+				flush_dcache_folio(folio);
+			folio_mark_accessed(folio);
+			/*
+			 * Ok, we have the page, and it's up-to-date, so we can
+			 * now splice it into the pipe.
+			 */
+			n = splice_folio_into_pipe(pipe, folio, *ppos, part);
+			folio_put(folio);
+			folio = NULL;
+		} else {
+			n = splice_zeropage_into_pipe(pipe, *ppos, len);
+		}
+
+		if (!n)
+			break;
+		len -= n;
+		total_spliced += n;
+		*ppos += n;
+		in->f_ra.prev_pos = *ppos;
+		if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
+			break;
+
+		cond_resched();
+	} while (len);
+
+	if (folio)
+		folio_put(folio);
+
+	file_accessed(in);
+	return total_spliced ? total_spliced : error;
+}
+
 static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 {
 	struct address_space *mapping = file->f_mapping;
@@ -3938,7 +4060,7 @@  static const struct file_operations shmem_file_operations = {
 	.read_iter	= shmem_file_read_iter,
 	.write_iter	= generic_file_write_iter,
 	.fsync		= noop_fsync,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= shmem_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= shmem_fallocate,
 #endif