diff mbox series

[2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()

Message ID 162876947840.3068428.12591293664586646085.stgit@warthog.procyon.org.uk (mailing list archive)
State New
Headers show
Series mm: Fix NFS swapfiles and use DIO read for swapfiles | expand

Commit Message

David Howells Aug. 12, 2021, 11:57 a.m. UTC
Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use
the ->direct_IO() method on the filesystem rather then ->readpage().

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 mm/page_io.c |   73 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 67 insertions(+), 6 deletions(-)

Comments

Christoph Hellwig Aug. 12, 2021, 12:21 p.m. UTC | #1
On Thu, Aug 12, 2021 at 12:57:58PM +0100, David Howells wrote:
> Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use
> the ->direct_IO() method on the filesystem rather then ->readpage().

->direct_IO is just a helper for ->read_iter and ->write_iter, so please
don't call it directly.  It actually is slowly on its way out, with at
at least all of the iomap implementations not using it, as well as various
other file systems.

> +	ki = kzalloc(sizeof(*ki), GFP_KERNEL);
> +	if (!ki)
> +		return -ENOMEM;

for the synchronous case we could avoid this allocation and just use
arguments on stack.
David Howells Aug. 12, 2021, 12:57 p.m. UTC | #2
Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Aug 12, 2021 at 12:57:58PM +0100, David Howells wrote:
> > Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use
> > the ->direct_IO() method on the filesystem rather then ->readpage().
> 
> ->direct_IO is just a helper for ->read_iter and ->write_iter, so please
> don't call it directly.  It actually is slowly on its way out, with at
> at least all of the iomap implementations not using it, as well as various
> other file systems.

[Note that __swap_writepage() uses ->direct_IO().]

Calling ->write_iter is probably a bad idea here.  Imagine that it goes
through, say, generic_file_write_iter(), then __generic_file_write_iter() and
then generic_file_direct_write().  It adds a number of delays into the system,
including:

	- Taking the inode lock
	- Removing file privs
	- Cranking mtime, ctime, file version
	  - Doing mnt_want_write
	  - Setting the inode dirty
	- Waiting on pages in the range that are being written 
	- Walking over the pagecache to invalidate the range
	- Redoing the invalidation (can't be skipped since page 0 is pinned)

that we might want to skip as they'll end up being done for every page swapped
out.

> > +	ki = kzalloc(sizeof(*ki), GFP_KERNEL);
> > +	if (!ki)
> > +		return -ENOMEM;
> 
> for the synchronous case we could avoid this allocation and just use
> arguments on stack.

True.

David
Matthew Wilcox (Oracle) Aug. 12, 2021, 1 p.m. UTC | #3
On Thu, Aug 12, 2021 at 12:57:58PM +0100, David Howells wrote:

I'm not quite sure why we need the refcount.

> +	refcount_set(&ki->ki_refcnt, 2);
> +	init_sync_kiocb(&ki->iocb, swap_file);
> +	ki->page = page;
> +	ki->iocb.ki_flags = IOCB_DIRECT | IOCB_SWAP;
> +	ki->iocb.ki_pos	= page_file_offset(page);
> +	ki->iocb.ki_filp = get_file(swap_file);
> +	if (!synchronous)
> +		ki->iocb.ki_complete = swapfile_read_complete;
> +
> +	iov_iter_bvec(&to, READ, &bv, 1, PAGE_SIZE);
> +	ret = swap_file->f_mapping->a_ops->direct_IO(&ki->iocb, &to);

After submitting the IO here ...

> +	if (ret != -EIOCBQUEUED)
> +		swapfile_read_complete(&ki->iocb, ret, 0);

We only touch the 'ki' here ... if the caller didn't call read_complete

> +	swapfile_put_kiocb(ki);

Except for here, which is only touched in order to put the refcount.

So why can't swapfile_read_complete() do the work of freeing the ki?
David Howells Aug. 12, 2021, 1:23 p.m. UTC | #4
Matthew Wilcox <willy@infradead.org> wrote:

> After submitting the IO here ...
> 
> > +	if (ret != -EIOCBQUEUED)
> > +		swapfile_read_complete(&ki->iocb, ret, 0);
> 
> We only touch the 'ki' here ... if the caller didn't call read_complete
> 
> > +	swapfile_put_kiocb(ki);
> 
> Except for here, which is only touched in order to put the refcount.
> 
> So why can't swapfile_read_complete() do the work of freeing the ki?

When I was doing something similar for cachefiles, I couldn't get it to work
like that.  I'll have another look at that.

David
David Howells Aug. 12, 2021, 1:37 p.m. UTC | #5
David Howells <dhowells@redhat.com> wrote:

> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > After submitting the IO here ...
> > 
> > > +	if (ret != -EIOCBQUEUED)
> > > +		swapfile_read_complete(&ki->iocb, ret, 0);
> > 
> > We only touch the 'ki' here ... if the caller didn't call read_complete
> > 
> > > +	swapfile_put_kiocb(ki);
> > 
> > Except for here, which is only touched in order to put the refcount.
> > 
> > So why can't swapfile_read_complete() do the work of freeing the ki?
> 
> When I was doing something similar for cachefiles, I couldn't get it to work
> like that.  I'll have another look at that.

Ah, yes.  generic_file_direct_write() accesses in the kiocb *after* calling
->direct_IO(), so the kiocb *must not* go away until after
generic_file_direct_write() has returned.

David
Matthew Wilcox (Oracle) Aug. 12, 2021, 1:50 p.m. UTC | #6
On Thu, Aug 12, 2021 at 02:37:59PM +0100, David Howells wrote:
> David Howells <dhowells@redhat.com> wrote:
> 
> > Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > > After submitting the IO here ...
> > > 
> > > > +	if (ret != -EIOCBQUEUED)
> > > > +		swapfile_read_complete(&ki->iocb, ret, 0);
> > > 
> > > We only touch the 'ki' here ... if the caller didn't call read_complete
> > > 
> > > > +	swapfile_put_kiocb(ki);
> > > 
> > > Except for here, which is only touched in order to put the refcount.
> > > 
> > > So why can't swapfile_read_complete() do the work of freeing the ki?
> > 
> > When I was doing something similar for cachefiles, I couldn't get it to work
> > like that.  I'll have another look at that.
> 
> Ah, yes.  generic_file_direct_write() accesses in the kiocb *after* calling
> ->direct_IO(), so the kiocb *must not* go away until after
> generic_file_direct_write() has returned.

This is a read, not a write ... but we don't care about ki_pos being
updated, so that store can be conditioned on IOCB_SWAP being clear.
Or instead of storing directly to ki_pos, we take a pointer to ki_pos
and then redirect that pointer somewhere harmless.
David Howells Aug. 12, 2021, 2:16 p.m. UTC | #7
Matthew Wilcox <willy@infradead.org> wrote:

> This is a read, not a write ... but we don't care about ki_pos being
> updated, so that store can be conditioned on IOCB_SWAP being clear.
> Or instead of storing directly to ki_pos, we take a pointer to ki_pos
> and then redirect that pointer somewhere harmless.

But see also ext4_dio_read_iter(), for example.  That touches ki_filp after
starting the op.

David
Matthew Wilcox (Oracle) Aug. 12, 2021, 3:39 p.m. UTC | #8
On Thu, Aug 12, 2021 at 01:57:05PM +0100, David Howells wrote:
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > On Thu, Aug 12, 2021 at 12:57:58PM +0100, David Howells wrote:
> > > Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use
> > > the ->direct_IO() method on the filesystem rather then ->readpage().
> > 
> > ->direct_IO is just a helper for ->read_iter and ->write_iter, so please
> > don't call it directly.  It actually is slowly on its way out, with at
> > at least all of the iomap implementations not using it, as well as various
> > other file systems.
> 
> [Note that __swap_writepage() uses ->direct_IO().]
> 
> Calling ->write_iter is probably a bad idea here.  Imagine that it goes
> through, say, generic_file_write_iter(), then __generic_file_write_iter() and
> then generic_file_direct_write().  It adds a number of delays into the system,
> including:
> 
> 	- Taking the inode lock
> 	- Removing file privs
> 	- Cranking mtime, ctime, file version
> 	  - Doing mnt_want_write
> 	  - Setting the inode dirty
> 	- Waiting on pages in the range that are being written 
> 	- Walking over the pagecache to invalidate the range
> 	- Redoing the invalidation (can't be skipped since page 0 is pinned)
> 
> that we might want to skip as they'll end up being done for every page swapped
> out.

I agree with David; we want something lower-level for swap to call into.
I'd suggest aops->swap_rw and an implementation might well look
something like:

static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
{
	return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0);
}
Christoph Hellwig Aug. 12, 2021, 5:02 p.m. UTC | #9
On Thu, Aug 12, 2021 at 04:39:40PM +0100, Matthew Wilcox wrote:
> I agree with David; we want something lower-level for swap to call into.
> I'd suggest aops->swap_rw and an implementation might well look
> something like:
> 
> static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
> {
> 	return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0);
> }

Yes, that might make sense and would also replace the awkward IOCB_SWAP
flag for the write side.

For file systems like ext4 and xfs that have an in-memory block mapping
tree this would be way better than the current version and also support
swap on say multi-device file systems properly.  We'd just need to be
careful to read the extent information in at extent_activate time,
by doing xfs_iread_extents for XFS or the equivalents in other file
systems.
Darrick J. Wong Aug. 12, 2021, 5:48 p.m. UTC | #10
On Thu, Aug 12, 2021 at 07:02:33PM +0200, Christoph Hellwig wrote:
> On Thu, Aug 12, 2021 at 04:39:40PM +0100, Matthew Wilcox wrote:
> > I agree with David; we want something lower-level for swap to call into.
> > I'd suggest aops->swap_rw and an implementation might well look
> > something like:
> > 
> > static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
> > {
> > 	return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0);
> > }
> 
> Yes, that might make sense and would also replace the awkward IOCB_SWAP
> flag for the write side.
> 
> For file systems like ext4 and xfs that have an in-memory block mapping
> tree this would be way better than the current version and also support
> swap on say multi-device file systems properly.  We'd just need to be
> careful to read the extent information in at extent_activate time,
> by doing xfs_iread_extents for XFS or the equivalents in other file
> systems.

You'd still want to walk the extent map at activation time to reject
swapfiles with holes, shared extents, etc., right?

--D
Matthew Wilcox (Oracle) Aug. 12, 2021, 6:14 p.m. UTC | #11
On Thu, Aug 12, 2021 at 10:48:18AM -0700, Darrick J. Wong wrote:
> On Thu, Aug 12, 2021 at 07:02:33PM +0200, Christoph Hellwig wrote:
> > On Thu, Aug 12, 2021 at 04:39:40PM +0100, Matthew Wilcox wrote:
> > > I agree with David; we want something lower-level for swap to call into.
> > > I'd suggest aops->swap_rw and an implementation might well look
> > > something like:
> > > 
> > > static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
> > > {
> > > 	return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0);
> > > }
> > 
> > Yes, that might make sense and would also replace the awkward IOCB_SWAP
> > flag for the write side.
> > 
> > For file systems like ext4 and xfs that have an in-memory block mapping
> > tree this would be way better than the current version and also support
> > swap on say multi-device file systems properly.  We'd just need to be
> > careful to read the extent information in at extent_activate time,
> > by doing xfs_iread_extents for XFS or the equivalents in other file
> > systems.
> 
> You'd still want to walk the extent map at activation time to reject
> swapfiles with holes, shared extents, etc., right?

Well ... this would actually allow the filesystem to break COWs and
allocate new blocks for holes.  Maybe you don't want to be doing that
in a low-memory situation though ;-)
Theodore Ts'o Aug. 12, 2021, 8:13 p.m. UTC | #12
On Thu, Aug 12, 2021 at 07:14:37PM +0100, Matthew Wilcox wrote:
> 
> Well ... this would actually allow the filesystem to break COWs and
> allocate new blocks for holes.  Maybe you don't want to be doing that
> in a low-memory situation though ;-)

I'm not sure the benefits are worth the costs.  You'd have to handle
ENOSPC errors, and it would require some kind of metadata journal
transaction, which could potentially block for any number of reasons
(not just due to memory allocations, but because you're waiting for a
journal commit to complete).  As you say, doing that in a low-memory
situation seems to be unneeded complexity.

					- Ted
Christoph Hellwig Aug. 13, 2021, 6:54 a.m. UTC | #13
On Thu, Aug 12, 2021 at 10:48:18AM -0700, Darrick J. Wong wrote:
> On Thu, Aug 12, 2021 at 07:02:33PM +0200, Christoph Hellwig wrote:
> > On Thu, Aug 12, 2021 at 04:39:40PM +0100, Matthew Wilcox wrote:
> > > I agree with David; we want something lower-level for swap to call into.
> > > I'd suggest aops->swap_rw and an implementation might well look
> > > something like:
> > > 
> > > static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
> > > {
> > > 	return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0);
> > > }
> > 
> > Yes, that might make sense and would also replace the awkward IOCB_SWAP
> > flag for the write side.
> > 
> > For file systems like ext4 and xfs that have an in-memory block mapping
> > tree this would be way better than the current version and also support
> > swap on say multi-device file systems properly.  We'd just need to be
> > careful to read the extent information in at extent_activate time,
> > by doing xfs_iread_extents for XFS or the equivalents in other file
> > systems.
> 
> You'd still want to walk the extent map at activation time to reject
> swapfiles with holes, shared extents, etc., right?

Yes.  While direct I/O code could do allocation at swap I/O time that
probably is not a good idea due to the memory requirements.
diff mbox series

Patch

diff --git a/mm/page_io.c b/mm/page_io.c
index edb72bf624d2..108f864cea28 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -354,6 +354,72 @@  int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	return 0;
 }
 
+struct swapfile_kiocb {
+	struct kiocb		iocb;
+	struct page		*page;
+	refcount_t		ki_refcnt;
+};
+
+static void swapfile_put_kiocb(struct swapfile_kiocb *ki)
+{
+	if (refcount_dec_and_test(&ki->ki_refcnt)) {
+		fput(ki->iocb.ki_filp);
+		kfree(ki);
+	}
+}
+
+static void swapfile_read_complete(struct kiocb *iocb, long ret, long ret2)
+{
+	struct swapfile_kiocb *ki = container_of(iocb, struct swapfile_kiocb, iocb);
+	struct page *page = ki->page;
+
+	if (ret == PAGE_SIZE) {
+		count_vm_event(PSWPIN);
+		SetPageUptodate(page);
+	} else {
+		pr_err_ratelimited("Read error (%ld) on dio swapfile (%llu)\n",
+				   ret, page_file_offset(page));
+	}
+
+	unlock_page(page);
+	swapfile_put_kiocb(ki);
+}
+
+static int swapfile_read(struct swap_info_struct *sis, struct page *page,
+			 bool synchronous)
+{
+	struct swapfile_kiocb *ki;
+	struct file *swap_file = sis->swap_file;
+	struct bio_vec bv = {
+		.bv_page = page,
+		.bv_len  = PAGE_SIZE,
+		.bv_offset = 0
+	};
+	struct iov_iter to;
+	int ret;
+
+	ki = kzalloc(sizeof(*ki), GFP_KERNEL);
+	if (!ki)
+		return -ENOMEM;
+
+	refcount_set(&ki->ki_refcnt, 2);
+	init_sync_kiocb(&ki->iocb, swap_file);
+	ki->page = page;
+	ki->iocb.ki_flags = IOCB_DIRECT | IOCB_SWAP;
+	ki->iocb.ki_pos	= page_file_offset(page);
+	ki->iocb.ki_filp = get_file(swap_file);
+	if (!synchronous)
+		ki->iocb.ki_complete = swapfile_read_complete;
+
+	iov_iter_bvec(&to, READ, &bv, 1, PAGE_SIZE);
+	ret = swap_file->f_mapping->a_ops->direct_IO(&ki->iocb, &to);
+
+	if (ret != -EIOCBQUEUED)
+		swapfile_read_complete(&ki->iocb, ret, 0);
+	swapfile_put_kiocb(ki);
+	return (ret > 0) ? 0 : ret;
+}
+
 int swap_readpage(struct page *page, bool synchronous)
 {
 	struct bio *bio;
@@ -381,12 +447,7 @@  int swap_readpage(struct page *page, bool synchronous)
 	}
 
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct file *swap_file = sis->swap_file;
-		struct address_space *mapping = swap_file->f_mapping;
-
-		ret = mapping->a_ops->readpage(swap_file, page);
-		if (!ret)
-			count_vm_event(PSWPIN);
+		ret = swapfile_read(sis, page, synchronous);
 		goto out;
 	}