Message ID | 162876947840.3068428.12591293664586646085.stgit@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Fix NFS swapfiles and use DIO read for swapfiles | expand |
On Thu, Aug 12, 2021 at 12:57:58PM +0100, David Howells wrote: > Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use > the ->direct_IO() method on the filesystem rather then ->readpage(). ->direct_IO is just a helper for ->read_iter and ->write_iter, so please don't call it directly. It actually is slowly on its way out, with at at least all of the iomap implementations not using it, as well as various other file systems. > + ki = kzalloc(sizeof(*ki), GFP_KERNEL); > + if (!ki) > + return -ENOMEM; for the synchronous case we could avoid this allocation and just use arguments on stack.
Christoph Hellwig <hch@lst.de> wrote: > On Thu, Aug 12, 2021 at 12:57:58PM +0100, David Howells wrote: > > Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use > > the ->direct_IO() method on the filesystem rather then ->readpage(). > > ->direct_IO is just a helper for ->read_iter and ->write_iter, so please > don't call it directly. It actually is slowly on its way out, with at > at least all of the iomap implementations not using it, as well as various > other file systems. [Note that __swap_writepage() uses ->direct_IO().] Calling ->write_iter is probably a bad idea here. Imagine that it goes through, say, generic_file_write_iter(), then __generic_file_write_iter() and then generic_file_direct_write(). It adds a number of delays into the system, including: - Taking the inode lock - Removing file privs - Cranking mtime, ctime, file version - Doing mnt_want_write - Setting the inode dirty - Waiting on pages in the range that are being written - Walking over the pagecache to invalidate the range - Redoing the invalidation (can't be skipped since page 0 is pinned) that we might want to skip as they'll end up being done for every page swapped out. > > + ki = kzalloc(sizeof(*ki), GFP_KERNEL); > > + if (!ki) > > + return -ENOMEM; > > for the synchronous case we could avoid this allocation and just use > arguments on stack. True. David
On Thu, Aug 12, 2021 at 12:57:58PM +0100, David Howells wrote: I'm not quite sure why we need the refcount. > + refcount_set(&ki->ki_refcnt, 2); > + init_sync_kiocb(&ki->iocb, swap_file); > + ki->page = page; > + ki->iocb.ki_flags = IOCB_DIRECT | IOCB_SWAP; > + ki->iocb.ki_pos = page_file_offset(page); > + ki->iocb.ki_filp = get_file(swap_file); > + if (!synchronous) > + ki->iocb.ki_complete = swapfile_read_complete; > + > + iov_iter_bvec(&to, READ, &bv, 1, PAGE_SIZE); > + ret = swap_file->f_mapping->a_ops->direct_IO(&ki->iocb, &to); After submitting the IO here ... > + if (ret != -EIOCBQUEUED) > + swapfile_read_complete(&ki->iocb, ret, 0); We only touch the 'ki' here ... if the caller didn't call read_complete > + swapfile_put_kiocb(ki); Except for here, which is only touched in order to put the refcount. So why can't swapfile_read_complete() do the work of freeing the ki?
Matthew Wilcox <willy@infradead.org> wrote: > After submitting the IO here ... > > > + if (ret != -EIOCBQUEUED) > > + swapfile_read_complete(&ki->iocb, ret, 0); > > We only touch the 'ki' here ... if the caller didn't call read_complete > > > + swapfile_put_kiocb(ki); > > Except for here, which is only touched in order to put the refcount. > > So why can't swapfile_read_complete() do the work of freeing the ki? When I was doing something similar for cachefiles, I couldn't get it to work like that. I'll have another look at that. David
David Howells <dhowells@redhat.com> wrote: > Matthew Wilcox <willy@infradead.org> wrote: > > > After submitting the IO here ... > > > > > + if (ret != -EIOCBQUEUED) > > > + swapfile_read_complete(&ki->iocb, ret, 0); > > > > We only touch the 'ki' here ... if the caller didn't call read_complete > > > > > + swapfile_put_kiocb(ki); > > > > Except for here, which is only touched in order to put the refcount. > > > > So why can't swapfile_read_complete() do the work of freeing the ki? > > When I was doing something similar for cachefiles, I couldn't get it to work > like that. I'll have another look at that. Ah, yes. generic_file_direct_write() accesses in the kiocb *after* calling ->direct_IO(), so the kiocb *must not* go away until after generic_file_direct_write() has returned. David
On Thu, Aug 12, 2021 at 02:37:59PM +0100, David Howells wrote: > David Howells <dhowells@redhat.com> wrote: > > > Matthew Wilcox <willy@infradead.org> wrote: > > > > > After submitting the IO here ... > > > > > > > + if (ret != -EIOCBQUEUED) > > > > + swapfile_read_complete(&ki->iocb, ret, 0); > > > > > > We only touch the 'ki' here ... if the caller didn't call read_complete > > > > > > > + swapfile_put_kiocb(ki); > > > > > > Except for here, which is only touched in order to put the refcount. > > > > > > So why can't swapfile_read_complete() do the work of freeing the ki? > > > > When I was doing something similar for cachefiles, I couldn't get it to work > > like that. I'll have another look at that. > > Ah, yes. generic_file_direct_write() accesses in the kiocb *after* calling > ->direct_IO(), so the kiocb *must not* go away until after > generic_file_direct_write() has returned. This is a read, not a write ... but we don't care about ki_pos being updated, so that store can be conditioned on IOCB_SWAP being clear. Or instead of storing directly to ki_pos, we take a pointer to ki_pos and then redirect that pointer somewhere harmless.
Matthew Wilcox <willy@infradead.org> wrote: > This is a read, not a write ... but we don't care about ki_pos being > updated, so that store can be conditioned on IOCB_SWAP being clear. > Or instead of storing directly to ki_pos, we take a pointer to ki_pos > and then redirect that pointer somewhere harmless. But see also ext4_dio_read_iter(), for example. That touches ki_filp after starting the op. David
On Thu, Aug 12, 2021 at 01:57:05PM +0100, David Howells wrote: > Christoph Hellwig <hch@lst.de> wrote: > > > On Thu, Aug 12, 2021 at 12:57:58PM +0100, David Howells wrote: > > > Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use > > > the ->direct_IO() method on the filesystem rather then ->readpage(). > > > > ->direct_IO is just a helper for ->read_iter and ->write_iter, so please > > don't call it directly. It actually is slowly on its way out, with at > > at least all of the iomap implementations not using it, as well as various > > other file systems. > > [Note that __swap_writepage() uses ->direct_IO().] > > Calling ->write_iter is probably a bad idea here. Imagine that it goes > through, say, generic_file_write_iter(), then __generic_file_write_iter() and > then generic_file_direct_write(). It adds a number of delays into the system, > including: > > - Taking the inode lock > - Removing file privs > - Cranking mtime, ctime, file version > - Doing mnt_want_write > - Setting the inode dirty > - Waiting on pages in the range that are being written > - Walking over the pagecache to invalidate the range > - Redoing the invalidation (can't be skipped since page 0 is pinned) > > that we might want to skip as they'll end up being done for every page swapped > out. I agree with David; we want something lower-level for swap to call into. I'd suggest aops->swap_rw and an implementation might well look something like: static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter) { return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0); }
On Thu, Aug 12, 2021 at 04:39:40PM +0100, Matthew Wilcox wrote: > I agree with David; we want something lower-level for swap to call into. > I'd suggest aops->swap_rw and an implementation might well look > something like: > > static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter) > { > return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0); > } Yes, that might make sense and would also replace the awkward IOCB_SWAP flag for the write side. For file systems like ext4 and xfs that have an in-memory block mapping tree this would be way better than the current version and also support swap on say multi-device file systems properly. We'd just need to be careful to read the extent information in at extent_activate time, by doing xfs_iread_extents for XFS or the equivalents in other file systems.
On Thu, Aug 12, 2021 at 07:02:33PM +0200, Christoph Hellwig wrote: > On Thu, Aug 12, 2021 at 04:39:40PM +0100, Matthew Wilcox wrote: > > I agree with David; we want something lower-level for swap to call into. > > I'd suggest aops->swap_rw and an implementation might well look > > something like: > > > > static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter) > > { > > return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0); > > } > > Yes, that might make sense and would also replace the awkward IOCB_SWAP > flag for the write side. > > For file systems like ext4 and xfs that have an in-memory block mapping > tree this would be way better than the current version and also support > swap on say multi-device file systems properly. We'd just need to be > careful to read the extent information in at extent_activate time, > by doing xfs_iread_extents for XFS or the equivalents in other file > systems. You'd still want to walk the extent map at activation time to reject swapfiles with holes, shared extents, etc., right? --D
On Thu, Aug 12, 2021 at 10:48:18AM -0700, Darrick J. Wong wrote: > On Thu, Aug 12, 2021 at 07:02:33PM +0200, Christoph Hellwig wrote: > > On Thu, Aug 12, 2021 at 04:39:40PM +0100, Matthew Wilcox wrote: > > > I agree with David; we want something lower-level for swap to call into. > > > I'd suggest aops->swap_rw and an implementation might well look > > > something like: > > > > > > static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter) > > > { > > > return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0); > > > } > > > > Yes, that might make sense and would also replace the awkward IOCB_SWAP > > flag for the write side. > > > > For file systems like ext4 and xfs that have an in-memory block mapping > > tree this would be way better than the current version and also support > > swap on say multi-device file systems properly. We'd just need to be > > careful to read the extent information in at extent_activate time, > > by doing xfs_iread_extents for XFS or the equivalents in other file > > systems. > > You'd still want to walk the extent map at activation time to reject > swapfiles with holes, shared extents, etc., right? Well ... this would actually allow the filesystem to break COWs and allocate new blocks for holes. Maybe you don't want to be doing that in a low-memory situation though ;-)
On Thu, Aug 12, 2021 at 07:14:37PM +0100, Matthew Wilcox wrote: > > Well ... this would actually allow the filesystem to break COWs and > allocate new blocks for holes. Maybe you don't want to be doing that > in a low-memory situation though ;-) I'm not sure the benefits are worth the costs. You'd have to handle ENOSPC errors, and it would require some kind of metadata journal transaction, which could potentially block for any number of reasons (not just due to memory allocations, but because you're waiting for a journal commit to complete). As you say, doing that in a low-memory situation seems to be unneeded complexity. - Ted
On Thu, Aug 12, 2021 at 10:48:18AM -0700, Darrick J. Wong wrote: > On Thu, Aug 12, 2021 at 07:02:33PM +0200, Christoph Hellwig wrote: > > On Thu, Aug 12, 2021 at 04:39:40PM +0100, Matthew Wilcox wrote: > > > I agree with David; we want something lower-level for swap to call into. > > > I'd suggest aops->swap_rw and an implementation might well look > > > something like: > > > > > > static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter) > > > { > > > return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0); > > > } > > > > Yes, that might make sense and would also replace the awkward IOCB_SWAP > > flag for the write side. > > > > For file systems like ext4 and xfs that have an in-memory block mapping > > tree this would be way better than the current version and also support > > swap on say multi-device file systems properly. We'd just need to be > > careful to read the extent information in at extent_activate time, > > by doing xfs_iread_extents for XFS or the equivalents in other file > > systems. > > You'd still want to walk the extent map at activation time to reject > swapfiles with holes, shared extents, etc., right? Yes. While direct I/O code could do allocation at swap I/O time that probably is not a good idea due to the memory requirements.
diff --git a/mm/page_io.c b/mm/page_io.c index edb72bf624d2..108f864cea28 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -354,6 +354,72 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, return 0; } +struct swapfile_kiocb { + struct kiocb iocb; + struct page *page; + refcount_t ki_refcnt; +}; + +static void swapfile_put_kiocb(struct swapfile_kiocb *ki) +{ + if (refcount_dec_and_test(&ki->ki_refcnt)) { + fput(ki->iocb.ki_filp); + kfree(ki); + } +} + +static void swapfile_read_complete(struct kiocb *iocb, long ret, long ret2) +{ + struct swapfile_kiocb *ki = container_of(iocb, struct swapfile_kiocb, iocb); + struct page *page = ki->page; + + if (ret == PAGE_SIZE) { + count_vm_event(PSWPIN); + SetPageUptodate(page); + } else { + pr_err_ratelimited("Read error (%ld) on dio swapfile (%llu)\n", + ret, page_file_offset(page)); + } + + unlock_page(page); + swapfile_put_kiocb(ki); +} + +static int swapfile_read(struct swap_info_struct *sis, struct page *page, + bool synchronous) +{ + struct swapfile_kiocb *ki; + struct file *swap_file = sis->swap_file; + struct bio_vec bv = { + .bv_page = page, + .bv_len = PAGE_SIZE, + .bv_offset = 0 + }; + struct iov_iter to; + int ret; + + ki = kzalloc(sizeof(*ki), GFP_KERNEL); + if (!ki) + return -ENOMEM; + + refcount_set(&ki->ki_refcnt, 2); + init_sync_kiocb(&ki->iocb, swap_file); + ki->page = page; + ki->iocb.ki_flags = IOCB_DIRECT | IOCB_SWAP; + ki->iocb.ki_pos = page_file_offset(page); + ki->iocb.ki_filp = get_file(swap_file); + if (!synchronous) + ki->iocb.ki_complete = swapfile_read_complete; + + iov_iter_bvec(&to, READ, &bv, 1, PAGE_SIZE); + ret = swap_file->f_mapping->a_ops->direct_IO(&ki->iocb, &to); + + if (ret != -EIOCBQUEUED) + swapfile_read_complete(&ki->iocb, ret, 0); + swapfile_put_kiocb(ki); + return (ret > 0) ? 0 : ret; +} + int swap_readpage(struct page *page, bool synchronous) { struct bio *bio; @@ -381,12 +447,7 @@ int swap_readpage(struct page *page, bool synchronous) } if (data_race(sis->flags & SWP_FS_OPS)) { - struct file *swap_file = sis->swap_file; - struct address_space *mapping = swap_file->f_mapping; - - ret = mapping->a_ops->readpage(swap_file, page); - if (!ret) - count_vm_event(PSWPIN); + ret = swapfile_read(sis, page, synchronous); goto out; }
Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use the ->direct_IO() method on the filesystem rather then ->readpage(). Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> --- mm/page_io.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 67 insertions(+), 6 deletions(-)