Message ID | 20180514153624.29598-7-agruenba@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, May 14, 2018 at 05:36:19PM +0200, Andreas Gruenbacher wrote: > Add write_begin and write_end operations to struct iomap_ops to provide > a way of overriding the default behavior of iomap_write_begin and > iomap_write_end. This is needed for implementing data journaling: in > the data journaling case, pages are written into the journal before > being written back to their proper on-disk locations. I'm not sure adding a per-page filesystem callout abstraction is what we want in the iomap code. The commit message does not explain why gfs2 needs per-page callouts, nor why the per-page work cannot be done/avoided by running code in the ->iomap_begin callout (e.g. by unstuffing the inode so nothing needs to be done per-page). My main concern is that the callouts to the filesystem in iomap are - by design - per IO, not per page. The whole point of using iomap was to get away from doing per-page filesystem operations on multi-page IOs - it's highly inefficient when we only need to call into the filesystem on a per-extent basis, and we simplified the code a *lot* by avoiding such behaviours. And to that point: this change has serious conflicts with the next step for the iomap infrastructure: Christoph's recent "remove bufferheads from iomap" series. Apart from the obvious code conflicts, there's a fairly major architectural/functional conflict. https://marc.info/?l=linux-fsdevel&m=152585212000411&w=2 That is, Christoph's patchset pushes us further away from filesystems doing their own per-page processing of page state. It converts the iomap IO path to store it's own per-page state tracking information in page->private, greatly reducing memory overhead (16 bytes on 4k page instead of 104 bytes per bufferhead) and streamlining the code. This, however, essentially makes the buffered IO path through the iomap infrastructure incompatible with existing bufferhead based filesystems. Depsite this, we can still provide limited support for buffered writes on bufferhead based filesystems through the iomap infrastructure, but I only see that as a mechanism for filesystems moving away from dependencies on bufferheads. It would require a different refactoring of the iomap code - I'd much prefer to keep bufferhead dependent IO paths completely separate (e.g. via a new actor function) to minimise impact and dependencies on the internal bufferhead free iomap IO path.... Let's see what Christoph thinks first, though.... Cheers, Dave.
On Tue, May 15, 2018 at 11:11:47AM +1000, Dave Chinner wrote: > I'm not sure adding a per-page filesystem callout abstraction is > what we want in the iomap code. The commit message does not explain > why gfs2 needs per-page callouts, nor why the per-page work cannot > be done/avoided by running code in the ->iomap_begin callout > (e.g. by unstuffing the inode so nothing needs to be done per-page). Agreed. > > My main concern is that the callouts to the filesystem in iomap are > - by design - per IO, not per page. The whole point of using iomap > was to get away from doing per-page filesystem operations on > multi-page IOs - it's highly inefficient when we only need to call > into the filesystem on a per-extent basis, and we simplified the > code a *lot* by avoiding such behaviours. Yes. And from a quick look we can easily handle "stuffed" aka inline data with the existing architecture. We just need a new IOMAP_INLINE flag for the extent type. Details will need to be worked out how to pass that data, either a struct page or virtual address should probably do it. The other thing gfs2 seems to be doing is handling of journaled data. I'm not quite sure how to fit that into the grand overall scheme of things, but at least that would only be after the write. > That is, Christoph's patchset pushes us further away from > filesystems doing their own per-page processing of page state. It > converts the iomap IO path to store it's own per-page state tracking > information in page->private, greatly reducing memory overhead (16 > bytes on 4k page instead of 104 bytes per bufferhead) and Or in fact zero bytes for block size == PAGE_SIZE > streamlining the code. This, however, essentially makes the > buffered IO path through the iomap infrastructure incompatible with > existing bufferhead based filesystems. > > Depsite this, we can still provide limited support for buffered > writes on bufferhead based filesystems through the iomap > infrastructure, but I only see that as a mechanism for filesystems > moving away from dependencies on bufferheads. It would require a > different refactoring of the iomap code - I'd much prefer to keep > bufferhead dependent IO paths completely separate (e.g. via a new > actor function) to minimise impact and dependencies on the internal > bufferhead free iomap IO path.... > > Let's see what Christoph thinks first, though.... gfs2 seems to indeed depend pretty heavily on buffer_heads. This is a bit of a bummer, but I'd need to take a deeper look how to offer iomap functionality to them without them moving away from buffer_heads which isn't really going to be a quick project. In a way the write_begin/write_end ops from Andreas facilitate that, as the rest of iomap_file_buffered_write and iomap_write_actor are untouched. So at least temporarily his callouts are what we need, even if the actual inline data functionality should be moved out of them for sure, but only with a long term plan to move gfs2 away from buffer heads. Btw, does gfs2 support blocksizes smallers than PAGE_SIZE? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ---end quoted text---
On 15 May 2018 at 09:22, Christoph Hellwig <hch@lst.de> wrote: > On Tue, May 15, 2018 at 11:11:47AM +1000, Dave Chinner wrote: >> I'm not sure adding a per-page filesystem callout abstraction is >> what we want in the iomap code. The commit message does not explain >> why gfs2 needs per-page callouts, nor why the per-page work cannot >> be done/avoided by running code in the ->iomap_begin callout >> (e.g. by unstuffing the inode so nothing needs to be done per-page). > > Agreed. > >> My main concern is that the callouts to the filesystem in iomap are >> - by design - per IO, not per page. The whole point of using iomap >> was to get away from doing per-page filesystem operations on >> multi-page IOs - it's highly inefficient when we only need to call >> into the filesystem on a per-extent basis, and we simplified the >> code a *lot* by avoiding such behaviours. > > Yes. And from a quick look we can easily handle "stuffed" aka inline > data with the existing architecture. We just need a new IOMAP_INLINE > flag for the extent type. Details will need to be worked out how to > pass that data, either a struct page or virtual address should probably > do it. This is about data journaling, it has nothing to do with "stuffed" files per se. With journaled data writes, we need to do some special processing for each page so that the page ends up in the journal before it gets written back to its proper on-disk location. It just happens that since we need these per-page operations anyway, the "unstuffing" and "re-stuffing" nicely fits in those operations as well. > The other thing gfs2 seems to be doing is handling of journaled > data. I'm not quite sure how to fit that into the grand overall scheme > of things, but at least that would only be after the write. Not sure what you mean by "but at least that would only be after the write". >> That is, Christoph's patchset pushes us further away from >> filesystems doing their own per-page processing of page state. It >> converts the iomap IO path to store it's own per-page state tracking >> information in page->private, greatly reducing memory overhead (16 >> bytes on 4k page instead of 104 bytes per bufferhead) and > > Or in fact zero bytes for block size == PAGE_SIZE > >> streamlining the code. This, however, essentially makes the >> buffered IO path through the iomap infrastructure incompatible with >> existing bufferhead based filesystems. >> >> Depsite this, we can still provide limited support for buffered >> writes on bufferhead based filesystems through the iomap >> infrastructure, but I only see that as a mechanism for filesystems >> moving away from dependencies on bufferheads. It would require a >> different refactoring of the iomap code - I'd much prefer to keep >> bufferhead dependent IO paths completely separate (e.g. via a new >> actor function) to minimise impact and dependencies on the internal >> bufferhead free iomap IO path.... >> >> Let's see what Christoph thinks first, though.... > > gfs2 seems to indeed depend pretty heavily on buffer_heads. This is > a bit of a bummer, but I'd need to take a deeper look how to offer > iomap functionality to them without them moving away from buffer_heads > which isn't really going to be a quick project. In a way the > write_begin/write_end ops from Andreas facilitate that, as the > rest of iomap_file_buffered_write and iomap_write_actor are untouched. > > So at least temporarily his callouts are what we need, even if the > actual inline data functionality should be moved out of them for sure, Been there. An earlier version of the patches did have a separate stuffed write path and the unstuffing and re-stuffing didn't happen in those per-page operations. The result was much messier overall. > but only with a long term plan to move gfs2 away from buffer heads. I hope we'll get rid of buffer heads in additional parts of gfs2, but that's not going to happen soon. The main point of those per-page operations is still to support data journaling though. > Btw, does gfs2 support blocksizes smallers than PAGE_SIZE? Yes. Thanks, Andreas
On Tue, May 15, 2018 at 10:16:52AM +0200, Andreas Gruenbacher wrote: > > Yes. And from a quick look we can easily handle "stuffed" aka inline > > data with the existing architecture. We just need a new IOMAP_INLINE > > flag for the extent type. Details will need to be worked out how to > > pass that data, either a struct page or virtual address should probably > > do it. > > This is about data journaling, it has nothing to do with "stuffed" > files per se. With journaled data writes, we need to do some special > processing for each page so that the page ends up in the journal > before it gets written back to its proper on-disk location. It just > happens that since we need these per-page operations anyway, the > "unstuffing" and "re-stuffing" nicely fits in those operations as > well. Your series makes two uses of the callbacks - one use case is to deal with inline data. The right way to deal with that is to make inline data an explicit iomap type (done in my next posting of the buffer head removal series), and then have iomap_begin find the data, kmap it and return it in the iomap, with iomap_end doing any fixups. For the data journaing I agree that we need a post-write per page callout, but I think this should be just a simple optional callout just for data journaling, which doesn't override the remaining write_end handling. > Been there. An earlier version of the patches did have a separate > stuffed write path and the unstuffing and re-stuffing didn't happen in > those per-page operations. The result was much messier overall. Pointer, please.
2018-05-18 18:04 GMT+02:00 Christoph Hellwig <hch@lst.de>: > On Tue, May 15, 2018 at 10:16:52AM +0200, Andreas Gruenbacher wrote: >> > Yes. And from a quick look we can easily handle "stuffed" aka inline >> > data with the existing architecture. We just need a new IOMAP_INLINE >> > flag for the extent type. Details will need to be worked out how to >> > pass that data, either a struct page or virtual address should probably >> > do it. >> >> This is about data journaling, it has nothing to do with "stuffed" >> files per se. With journaled data writes, we need to do some special >> processing for each page so that the page ends up in the journal >> before it gets written back to its proper on-disk location. It just >> happens that since we need these per-page operations anyway, the >> "unstuffing" and "re-stuffing" nicely fits in those operations as >> well. > > Your series makes two uses of the callbacks - one use case is to > deal with inline data. Yes, it's not perfect. > The right way to deal with that is to make > inline data an explicit iomap type (done in my next posting of the > buffer head removal series), and then have iomap_begin find the data, > kmap it and return it in the iomap, with iomap_end doing any fixups. I see what you mean. How would iomap_write_actor deal with IOMAP_INLINE -- just skip iomap_write_begin and iomap_write_end and grab the page from the iomap? I'm a bit worried about reintroducing the deadlock that iomap_write_actor so carefully tries to avoid. > For the data journaing I agree that we need a post-write per page > callout, but I think this should be just a simple optional callout > just for data journaling, which doesn't override the remaining > write_end handling. > >> Been there. An earlier version of the patches did have a separate >> stuffed write path and the unstuffing and re-stuffing didn't happen in >> those per-page operations. The result was much messier overall. > > Pointer, please. It's in this patch: https://patchwork.kernel.org/patch/10191007/ from this posting: https://www.spinics.net/lists/linux-fsdevel/msg121318.html Thanks, Andreas
On Fri, May 25, 2018 at 07:58:14PM +0200, Andreas Grünbacher wrote: > > The right way to deal with that is to make > > inline data an explicit iomap type (done in my next posting of the > > buffer head removal series), and then have iomap_begin find the data, > > kmap it and return it in the iomap, with iomap_end doing any fixups. > > I see what you mean. How would iomap_write_actor deal with > IOMAP_INLINE -- just skip iomap_write_begin and iomap_write_end and > grab the page from the iomap? I would add a data pointer to struct iomap and pass the data back from there. We'll still need most of iomap_write_begin / iomap_write_end, but just copy the data from the iomap instead of reading it from disk. Note that the block_write_begin code already handles that case for the get_blocks path.
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 1e01fabef130..9598ea936405 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -848,6 +848,8 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length, const struct iomap_ops ext2_iomap_ops = { .iomap_begin = ext2_iomap_begin, + .write_begin = iomap_write_begin, + .write_end = iomap_write_end, .iomap_end = ext2_iomap_end, }; #else diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 1e50c5efae67..05c8e8015f13 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3604,6 +3604,8 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length, const struct iomap_ops ext4_iomap_ops = { .iomap_begin = ext4_iomap_begin, + .write_begin = iomap_write_begin, + .write_end = iomap_write_end, .iomap_end = ext4_iomap_end, }; diff --git a/fs/iomap.c b/fs/iomap.c index afd163586aa0..27d97a290623 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -108,7 +108,7 @@ iomap_write_failed(struct inode *inode, loff_t pos, unsigned len) truncate_pagecache_range(inode, max(pos, i_size), pos + len); } -static int +int iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, struct page **pagep, struct iomap *iomap) { @@ -137,8 +137,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, *pagep = page; return status; } +EXPORT_SYMBOL_GPL(iomap_write_begin); -static int +int iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied, struct page *page) { @@ -150,12 +151,19 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len, iomap_write_failed(inode, pos, len); return ret; } +EXPORT_SYMBOL_GPL(iomap_write_end); + +struct iomap_write_args { + struct iov_iter *iter; + const struct iomap_ops *ops; +}; static loff_t iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data, struct iomap *iomap) { - struct iov_iter *i = data; + struct iomap_write_args *args = data; + struct iov_iter *i = args->iter; long status = 0; ssize_t written = 0; unsigned int flags = AOP_FLAG_NOFS; @@ -188,7 +196,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data, break; } - status = iomap_write_begin(inode, pos, bytes, flags, &page, + status = args->ops->write_begin(inode, pos, bytes, flags, &page, iomap); if (unlikely(status)) break; @@ -200,7 +208,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data, flush_dcache_page(page); - status = iomap_write_end(inode, pos, bytes, copied, page); + status = args->ops->write_end(inode, pos, bytes, copied, page); if (unlikely(status < 0)) break; copied = status; @@ -237,10 +245,14 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter, { struct inode *inode = iocb->ki_filp->f_mapping->host; loff_t pos = iocb->ki_pos, ret = 0, written = 0; + struct iomap_write_args args = { + .iter = iter, + .ops = ops, + }; while (iov_iter_count(iter)) { ret = iomap_apply(inode, pos, iov_iter_count(iter), - IOMAP_WRITE, ops, iter, iomap_write_actor); + IOMAP_WRITE, ops, &args, iomap_write_actor); if (ret <= 0) break; pos += ret; @@ -271,6 +283,7 @@ static loff_t iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data, struct iomap *iomap) { + const struct iomap_ops *ops = data; long status = 0; ssize_t written = 0; @@ -286,15 +299,15 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data, if (IS_ERR(rpage)) return PTR_ERR(rpage); - status = iomap_write_begin(inode, pos, bytes, - AOP_FLAG_NOFS, &page, iomap); + status = ops->write_begin(inode, pos, bytes, + AOP_FLAG_NOFS, &page, iomap); put_page(rpage); if (unlikely(status)) return status; WARN_ON_ONCE(!PageUptodate(page)); - status = iomap_write_end(inode, pos, bytes, bytes, page); + status = ops->write_end(inode, pos, bytes, bytes, page); if (unlikely(status <= 0)) { if (WARN_ON_ONCE(status == 0)) return -EIO; @@ -320,7 +333,7 @@ iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len, loff_t ret; while (len) { - ret = iomap_apply(inode, pos, len, IOMAP_WRITE, ops, NULL, + ret = iomap_apply(inode, pos, len, IOMAP_WRITE, ops, (void *)ops, iomap_dirty_actor); if (ret <= 0) return ret; @@ -333,20 +346,21 @@ iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len, EXPORT_SYMBOL_GPL(iomap_file_dirty); static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset, - unsigned bytes, struct iomap *iomap) + unsigned bytes, const struct iomap_ops *ops, + struct iomap *iomap) { struct page *page; int status; - status = iomap_write_begin(inode, pos, bytes, AOP_FLAG_NOFS, &page, - iomap); + status = ops->write_begin(inode, pos, bytes, AOP_FLAG_NOFS, &page, + iomap); if (status) return status; zero_user(page, offset, bytes); mark_page_accessed(page); - return iomap_write_end(inode, pos, bytes, bytes, page); + return ops->write_end(inode, pos, bytes, bytes, page); } static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes, @@ -359,11 +373,16 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes, offset, bytes); } +struct iomap_zero_range_args { + const struct iomap_ops *ops; + bool *did_zero; +}; + static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count, void *data, struct iomap *iomap) { - bool *did_zero = data; + struct iomap_zero_range_args *args = data; loff_t written = 0; int status; @@ -380,15 +399,16 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count, if (IS_DAX(inode)) status = iomap_dax_zero(pos, offset, bytes, iomap); else - status = iomap_zero(inode, pos, offset, bytes, iomap); + status = iomap_zero(inode, pos, offset, bytes, + (void *)args->ops, iomap); if (status < 0) return status; pos += bytes; count -= bytes; written += bytes; - if (did_zero) - *did_zero = true; + if (args->did_zero) + *args->did_zero = true; } while (count > 0); return written; @@ -398,11 +418,15 @@ int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, const struct iomap_ops *ops) { + struct iomap_zero_range_args args = { + .ops = ops, + .did_zero = did_zero, + }; loff_t ret; while (len > 0) { ret = iomap_apply(inode, pos, len, IOMAP_ZERO, - ops, did_zero, iomap_zero_range_actor); + ops, &args, iomap_zero_range_actor); if (ret <= 0) return ret; diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 046469fcc1b8..2eec725cf989 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1199,6 +1199,8 @@ xfs_file_iomap_end( const struct iomap_ops xfs_iomap_ops = { .iomap_begin = xfs_file_iomap_begin, + .write_begin = iomap_write_begin, + .write_end = iomap_write_end, .iomap_end = xfs_file_iomap_end, }; diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 19a07de28212..423f7ecc1231 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -61,6 +61,8 @@ struct iomap { #define IOMAP_DIRECT (1 << 4) /* direct I/O */ #define IOMAP_NOWAIT (1 << 5) /* Don't wait for writeback */ +struct page; + struct iomap_ops { /* * Return the existing mapping at pos, or reserve space starting at @@ -70,6 +72,21 @@ struct iomap_ops { int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length, unsigned flags, struct iomap *iomap); + /* + * Begin writing to a page, similar to block_write_begin but in a + * mapping returned by iomap_begin. Usually initialized to + * iomap_write_begin. + */ + int (*write_begin)(struct inode *inode, loff_t pos, unsigned len, + unsigned flags, struct page **pagep, + struct iomap *iomap); + + /* + * End writing to a page. Usually initialized to iomap_write_end. + */ + int (*write_end)(struct inode *inode, loff_t pos, unsigned len, + unsigned copied, struct page *page); + /* * Commit and/or unreserve space previous allocated using iomap_begin. * Written indicates the length of the successful write operation which @@ -80,6 +97,11 @@ struct iomap_ops { ssize_t written, unsigned flags, struct iomap *iomap); }; +int iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, + unsigned flags, struct page **pagep, struct iomap *iomap); +int iomap_write_end(struct inode *inode, loff_t pos, unsigned len, + unsigned copied, struct page *page); + ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from, const struct iomap_ops *ops); int iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
Add write_begin and write_end operations to struct iomap_ops to provide a way of overriding the default behavior of iomap_write_begin and iomap_write_end. This is needed for implementing data journaling: in the data journaling case, pages are written into the journal before being written back to their proper on-disk locations. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: Dave Chinner <dchinner@redhat.com> --- fs/ext2/inode.c | 2 ++ fs/ext4/inode.c | 2 ++ fs/iomap.c | 62 ++++++++++++++++++++++++++++++------------- fs/xfs/xfs_iomap.c | 2 ++ include/linux/iomap.h | 22 +++++++++++++++ 5 files changed, 71 insertions(+), 19 deletions(-)