Message ID | 20210723174131.180813-1-hsiangkao@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v7] iomap: make inline data support more flexible | expand |
On Sat, Jul 24, 2021 at 01:41:31AM +0800, Gao Xiang wrote: > Add support for reading inline data content into the page cache from > nonzero page-aligned file offsets. This enables the EROFS tailpacking > mode where the last few bytes of the file are stored right after the > inode. > > The buffered write path remains untouched since EROFS cannot be used > for testing. It'd be better to be implemented if upcoming real users > care and provide a real pattern rather than leave untested dead code > around. My one complaint with this version is the subject line. It's a bit vague. I went with: iomap: Support file tail packing I also wrote a changelog entry that reads: The existing inline data support only works for cases where the entire file is stored as inline data. For larger files, EROFS stores the initial blocks separately and then can pack a small tail adjacent to the inode. Generalise inline data to allow for tail packing. Tails may not cross a page boundary in memory. ... but I'm not sure that's necessarily better than what you've written here. > Cc: Christoph Hellwig <hch@lst.de> > Cc: Darrick J. Wong <djwong@kernel.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> > Tested-by: Huang Jianan <huangjianan@oppo.com> # erofs > Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Hi Matthew, On Fri, Jul 23, 2021 at 08:40:51PM +0100, Matthew Wilcox wrote: > On Sat, Jul 24, 2021 at 01:41:31AM +0800, Gao Xiang wrote: > > Add support for reading inline data content into the page cache from > > nonzero page-aligned file offsets. This enables the EROFS tailpacking > > mode where the last few bytes of the file are stored right after the > > inode. > > > > The buffered write path remains untouched since EROFS cannot be used > > for testing. It'd be better to be implemented if upcoming real users > > care and provide a real pattern rather than leave untested dead code > > around. > > My one complaint with this version is the subject line. It's a bit vague. > I went with: > > iomap: Support file tail packing > > I also wrote a changelog entry that reads: > The existing inline data support only works for cases where the entire > file is stored as inline data. For larger files, EROFS stores the > initial blocks separately and then can pack a small tail adjacent to > the inode. Generalise inline data to allow for tail packing. Tails > may not cross a page boundary in memory. > Yeah, we could complete the commit message like this. Actually EROFS inode base is only 32-byte or 64-byte (so the maximum could not be exactly small), compared to using another tail block or storing other (maybe) irrelevant inodes. According to cache locality principle, a strategy can be selected by mkfs to load tail block with the inode base itself to the page cache by the tail-packing inline and so reduce I/O and fragmentation. > ... but I'm not sure that's necessarily better than what you've written > here. > > > Cc: Christoph Hellwig <hch@lst.de> > > Cc: Darrick J. Wong <djwong@kernel.org> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> > > Tested-by: Huang Jianan <huangjianan@oppo.com> # erofs > > Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> > > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Many thanks for the review! Thanks, Gao Xiang
Am Fr., 23. Juli 2021 um 19:41 Uhr schrieb Gao Xiang <hsiangkao@linux.alibaba.com>: > Add support for reading inline data content into the page cache from > nonzero page-aligned file offsets. This enables the EROFS tailpacking > mode where the last few bytes of the file are stored right after the > inode. > > The buffered write path remains untouched since EROFS cannot be used > for testing. It'd be better to be implemented if upcoming real users > care and provide a real pattern rather than leave untested dead code > around. > > Cc: Christoph Hellwig <hch@lst.de> > Cc: Darrick J. Wong <djwong@kernel.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> > Tested-by: Huang Jianan <huangjianan@oppo.com> # erofs > Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> > --- > v6: https://lore.kernel.org/r/20210722031729.51628-1-hsiangkao@linux.alibaba.com > changes since v6: > - based on Christoph's reply; > - update commit message suggested by Darrick; > - disable buffered write path until some real fs users. > > fs/iomap/buffered-io.c | 42 ++++++++++++++++++++++++++---------------- > fs/iomap/direct-io.c | 10 ++++++---- > include/linux/iomap.h | 14 ++++++++++++++ > 3 files changed, 46 insertions(+), 20 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 87ccb3438bec..f351e1f9e3f6 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -205,25 +205,29 @@ struct iomap_readpage_ctx { > struct readahead_control *rac; > }; > > -static void > -iomap_read_inline_data(struct inode *inode, struct page *page, > - struct iomap *iomap) > +static int iomap_read_inline_data(struct inode *inode, struct page *page, > + struct iomap *iomap, loff_t pos) This is completely broken. This function is about filling the page cache. All the information needed for that is in struct iomap; the start position of an iomap operation has nothing to do with it. > { > - size_t size = i_size_read(inode); > + size_t size = iomap->length + iomap->offset - pos; > void *addr; > > if (PageUptodate(page)) > - return; > + return PAGE_SIZE; > > - BUG_ON(page_has_private(page)); > - BUG_ON(page->index); > - BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data)); > + /* inline data must start page aligned in the file */ > + if (WARN_ON_ONCE(offset_in_page(pos))) > + return -EIO; > + if (WARN_ON_ONCE(!iomap_inline_data_size_valid(iomap))) > + return -EIO; > + if (WARN_ON_ONCE(page_has_private(page))) > + return -EIO; > > addr = kmap_atomic(page); > - memcpy(addr, iomap->inline_data, size); > + memcpy(addr, iomap_inline_buf(iomap, pos), size); > memset(addr + size, 0, PAGE_SIZE - size); > kunmap_atomic(addr); > SetPageUptodate(page); > + return PAGE_SIZE; > } > > static inline bool iomap_block_needs_zeroing(struct inode *inode, > @@ -246,11 +250,8 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > unsigned poff, plen; > sector_t sector; > > - if (iomap->type == IOMAP_INLINE) { > - WARN_ON_ONCE(pos); > - iomap_read_inline_data(inode, page, iomap); > - return PAGE_SIZE; > - } > + if (iomap->type == IOMAP_INLINE) > + return iomap_read_inline_data(inode, page, iomap, pos); > > /* zero post-eof blocks as the page may be mapped */ > iop = iomap_page_create(inode, page); > @@ -589,6 +590,15 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags, > return 0; > } > > +static int iomap_write_begin_inline(struct inode *inode, > + struct page *page, struct iomap *srcmap) > +{ > + /* needs more work for the tailpacking case, disable for now */ > + if (WARN_ON_ONCE(srcmap->offset != 0)) > + return -EIO; > + return iomap_read_inline_data(inode, page, srcmap, 0); > +} > + > static int > iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > struct page **pagep, struct iomap *iomap, struct iomap *srcmap) > @@ -618,14 +628,14 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > } > > if (srcmap->type == IOMAP_INLINE) > - iomap_read_inline_data(inode, page, srcmap); > + status = iomap_write_begin_inline(inode, page, srcmap); > else if (iomap->flags & IOMAP_F_BUFFER_HEAD) > status = __block_write_begin_int(page, pos, len, NULL, srcmap); > else > status = __iomap_write_begin(inode, pos, len, flags, page, > srcmap); > > - if (unlikely(status)) > + if (unlikely(status < 0)) > goto out_unlock; > > *pagep = page; > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 9398b8c31323..a6aaea2764a5 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -378,23 +378,25 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length, > struct iomap_dio *dio, struct iomap *iomap) > { > struct iov_iter *iter = dio->submit.iter; > + void *dst = iomap_inline_buf(iomap, pos); > size_t copied; > > - BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data)); > + if (WARN_ON_ONCE(!iomap_inline_data_size_valid(iomap))) > + return -EIO; > > if (dio->flags & IOMAP_DIO_WRITE) { > loff_t size = inode->i_size; > > if (pos > size) > - memset(iomap->inline_data + size, 0, pos - size); > - copied = copy_from_iter(iomap->inline_data + pos, length, iter); > + memset(iomap_inline_buf(iomap, size), 0, pos - size); > + copied = copy_from_iter(dst, length, iter); > if (copied) { > if (pos + copied > size) > i_size_write(inode, pos + copied); > mark_inode_dirty(inode); > } > } else { > - copied = copy_to_iter(iomap->inline_data + pos, length, iter); > + copied = copy_to_iter(dst, length, iter); > } > dio->size += copied; > return copied; > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 479c1da3e221..56b118c6d05c 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -97,6 +97,20 @@ iomap_sector(struct iomap *iomap, loff_t pos) > return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT; > } > > +static inline void *iomap_inline_buf(const struct iomap *iomap, loff_t pos) > +{ > + return iomap->inline_data - iomap->offset + pos; > +} > + > +/* > + * iomap->inline_data is a potentially kmapped page, ensure it never crosses a > + * page boundary. > + */ > +static inline bool iomap_inline_data_size_valid(const struct iomap *iomap) > +{ > + return iomap->length <= PAGE_SIZE - offset_in_page(iomap->inline_data); > +} > + > /* > * When a filesystem sets page_ops in an iomap mapping it returns, page_prepare > * and page_done will be called for each page written to. This only applies to > -- > 2.24.4 > Andreas
Here's a fixed and cleaned up version that passes fstests on gfs2. I see no reason why the combination of tail packing + writing should cause any issues, so in my opinion, the check that disables that combination in iomap_write_begin_inline should still be removed. It turns out that returning the number of bytes copied from iomap_read_inline_data is a bit irritating: the function is really used for filling the page, but that's not always the "progress" we're looking for. In the iomap_readpage case, we actually need to advance by an antire page, but in the iomap_file_buffered_write case, we need to advance by the length parameter of iomap_write_actor or less. So I've changed that back. I've also renamed iomap_inline_buf to iomap_inline_data and I've turned iomap_inline_data_size_valid into iomap_within_inline_data, which seems more useful to me. Thanks, Andreas -- Subject: [PATCH] iomap: Support tail packing The existing inline data support only works for cases where the entire file is stored as inline data. For larger files, EROFS stores the initial blocks separately and then can pack a small tail adjacent to the inode. Generalise inline data to allow for tail packing. Tails may not cross a page boundary in memory. We currently have no filesystems that support tail packing and writing, so that case is currently disabled (see iomap_write_begin_inline). I'm not aware of any reason why this code path shouldn't work, however. Cc: Christoph Hellwig <hch@lst.de> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> Tested-by: Huang Jianan <huangjianan@oppo.com> # erofs Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> --- fs/iomap/buffered-io.c | 34 +++++++++++++++++++++++----------- fs/iomap/direct-io.c | 11 ++++++----- include/linux/iomap.h | 22 +++++++++++++++++++++- 3 files changed, 50 insertions(+), 17 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 87ccb3438bec..334bf98fdd4a 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -205,25 +205,29 @@ struct iomap_readpage_ctx { struct readahead_control *rac; }; -static void -iomap_read_inline_data(struct inode *inode, struct page *page, +static int iomap_read_inline_data(struct inode *inode, struct page *page, struct iomap *iomap) { - size_t size = i_size_read(inode); + size_t size = i_size_read(inode) - iomap->offset; void *addr; if (PageUptodate(page)) - return; + return 0; - BUG_ON(page_has_private(page)); - BUG_ON(page->index); - BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data)); + /* inline and tail-packed data must start page aligned in the file */ + if (WARN_ON_ONCE(offset_in_page(iomap->offset))) + return -EIO; + if (WARN_ON_ONCE(size > PAGE_SIZE - offset_in_page(iomap->inline_data))) + return -EIO; + if (WARN_ON_ONCE(page_has_private(page))) + return -EIO; addr = kmap_atomic(page); memcpy(addr, iomap->inline_data, size); memset(addr + size, 0, PAGE_SIZE - size); kunmap_atomic(addr); SetPageUptodate(page); + return 0; } static inline bool iomap_block_needs_zeroing(struct inode *inode, @@ -247,7 +251,6 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, sector_t sector; if (iomap->type == IOMAP_INLINE) { - WARN_ON_ONCE(pos); iomap_read_inline_data(inode, page, iomap); return PAGE_SIZE; } @@ -589,6 +592,15 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags, return 0; } +static int iomap_write_begin_inline(struct inode *inode, + struct page *page, struct iomap *srcmap) +{ + /* needs more work for the tailpacking case, disable for now */ + if (WARN_ON_ONCE(srcmap->offset != 0)) + return -EIO; + return iomap_read_inline_data(inode, page, srcmap); +} + static int iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, struct page **pagep, struct iomap *iomap, struct iomap *srcmap) @@ -618,7 +630,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, } if (srcmap->type == IOMAP_INLINE) - iomap_read_inline_data(inode, page, srcmap); + status = iomap_write_begin_inline(inode, page, srcmap); else if (iomap->flags & IOMAP_F_BUFFER_HEAD) status = __block_write_begin_int(page, pos, len, NULL, srcmap); else @@ -671,11 +683,11 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page, void *addr; WARN_ON_ONCE(!PageUptodate(page)); - BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data)); + BUG_ON(!iomap_within_inline_data(iomap, pos + copied - 1)); flush_dcache_page(page); addr = kmap_atomic(page); - memcpy(iomap->inline_data + pos, addr + pos, copied); + memcpy(iomap_inline_data(iomap, pos), addr + pos, copied); kunmap_atomic(addr); mark_inode_dirty(inode); diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 9398b8c31323..c9424e58f613 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -380,21 +380,22 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length, struct iov_iter *iter = dio->submit.iter; size_t copied; - BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data)); + if (WARN_ON_ONCE(!iomap_within_inline_data(iomap, pos + length - 1))) + return -EIO; if (dio->flags & IOMAP_DIO_WRITE) { - loff_t size = inode->i_size; + loff_t size = iomap->offset + iomap->length; if (pos > size) - memset(iomap->inline_data + size, 0, pos - size); - copied = copy_from_iter(iomap->inline_data + pos, length, iter); + memset(iomap_inline_data(iomap, size), 0, pos - size); + copied = copy_from_iter(iomap_inline_data(iomap, pos), length, iter); if (copied) { if (pos + copied > size) i_size_write(inode, pos + copied); mark_inode_dirty(inode); } } else { - copied = copy_to_iter(iomap->inline_data + pos, length, iter); + copied = copy_to_iter(iomap_inline_data(iomap, pos), length, iter); } dio->size += copied; return copied; diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 479c1da3e221..c1b57d34cb76 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -28,7 +28,7 @@ struct vm_fault; #define IOMAP_DELALLOC 1 /* delayed allocation blocks */ #define IOMAP_MAPPED 2 /* blocks allocated at @addr */ #define IOMAP_UNWRITTEN 3 /* blocks allocated at @addr in unwritten state */ -#define IOMAP_INLINE 4 /* data inline in the inode */ +#define IOMAP_INLINE 4 /* inline or tail-packed data */ /* * Flags reported by the file system from iomap_begin: @@ -97,6 +97,26 @@ iomap_sector(struct iomap *iomap, loff_t pos) return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT; } +/* + * Returns the inline data pointer for logical offset @pos. + */ +static void *iomap_inline_data(struct iomap *iomap, loff_t pos) +{ + return iomap->inline_data + pos - iomap->offset; +} + +/* + * Check if logical offset @pos is within the valid range for inline data. + * This is used to guard against accessing data beyond the page inline_data + * points at. + */ +static bool iomap_within_inline_data(struct iomap *iomap, loff_t pos) +{ + unsigned int size = PAGE_SIZE - offset_in_page(iomap->inline_data); + + return pos - iomap->offset < size; +} + /* * When a filesystem sets page_ops in an iomap mapping it returns, page_prepare * and page_done will be called for each page written to. This only applies to
On Mon, Jul 26, 2021 at 12:16:39AM +0200, Andreas Gruenbacher wrote: > Here's a fixed and cleaned up version that passes fstests on gfs2. > > I see no reason why the combination of tail packing + writing should > cause any issues, so in my opinion, the check that disables that > combination in iomap_write_begin_inline should still be removed. Since there is no such fs for tail-packing write, I just do a wild guess, for example, 1) the tail-end block was not inlined, so iomap_write_end() dirtied the whole page (or buffer) for the page writeback; 2) then it was truncated into a tail-packing inline block so the last extent(page) became INLINE but dirty instead; 3) during the late page writeback for dirty pages, if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE)) would be triggered in iomap_writepage_map() for such dirty page. As Matthew pointed out before, https://lore.kernel.org/r/YPrms0fWPwEZGNAL@casper.infradead.org/ currently tail-packing inline won't interact with page writeback, but I'm afraid a supported tail-packing write fs needs to reconsider the whole stuff how page, inode writeback works and what the pattern is with the tail-packing. > > It turns out that returning the number of bytes copied from > iomap_read_inline_data is a bit irritating: the function is really used > for filling the page, but that's not always the "progress" we're looking > for. In the iomap_readpage case, we actually need to advance by an > antire page, but in the iomap_file_buffered_write case, we need to > advance by the length parameter of iomap_write_actor or less. So I've > changed that back. > > I've also renamed iomap_inline_buf to iomap_inline_data and I've turned > iomap_inline_data_size_valid into iomap_within_inline_data, which seems > more useful to me. > > Thanks, > Andreas > > -- > > Subject: [PATCH] iomap: Support tail packing > > The existing inline data support only works for cases where the entire > file is stored as inline data. For larger files, EROFS stores the > initial blocks separately and then can pack a small tail adjacent to the > inode. Generalise inline data to allow for tail packing. Tails may not > cross a page boundary in memory. > > We currently have no filesystems that support tail packing and writing, > so that case is currently disabled (see iomap_write_begin_inline). I'm > not aware of any reason why this code path shouldn't work, however. > > Cc: Christoph Hellwig <hch@lst.de> > Cc: Darrick J. Wong <djwong@kernel.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> > Tested-by: Huang Jianan <huangjianan@oppo.com> # erofs > Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> > --- > fs/iomap/buffered-io.c | 34 +++++++++++++++++++++++----------- > fs/iomap/direct-io.c | 11 ++++++----- > include/linux/iomap.h | 22 +++++++++++++++++++++- > 3 files changed, 50 insertions(+), 17 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 87ccb3438bec..334bf98fdd4a 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -205,25 +205,29 @@ struct iomap_readpage_ctx { > struct readahead_control *rac; > }; > > -static void > -iomap_read_inline_data(struct inode *inode, struct page *page, > +static int iomap_read_inline_data(struct inode *inode, struct page *page, > struct iomap *iomap) > { > - size_t size = i_size_read(inode); > + size_t size = i_size_read(inode) - iomap->offset; I wonder why you use i_size / iomap->offset here, and why you completely ignoring iomap->length field returning by fs. Using i_size here instead of iomap->length seems coupling to me in the beginning (even currently in practice there is some limitation.) Thanks, Gao Xiang
On Mon, Jul 26, 2021 at 12:16:39AM +0200, Andreas Gruenbacher wrote: > @@ -247,7 +251,6 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > sector_t sector; > > if (iomap->type == IOMAP_INLINE) { > - WARN_ON_ONCE(pos); > iomap_read_inline_data(inode, page, iomap); > return PAGE_SIZE; This surely needs to return -EIO if there was an error.
On Mon, Jul 26, 2021 at 12:16:39AM +0200, Andreas Gruenbacher wrote: > Here's a fixed and cleaned up version that passes fstests on gfs2. (cont. https://lore.kernel.org/r/YP4fk75mr%2FmIotDy@B-P7TQMD6M-0146.local) Would you mind listing what it fixed on gfs2 compared with v7? IOWs, I wonder which case failed with v7 on gfs2 so I could recheck this. > > I see no reason why the combination of tail packing + writing should > cause any issues, so in my opinion, the check that disables that > combination in iomap_write_begin_inline should still be removed. > > It turns out that returning the number of bytes copied from > iomap_read_inline_data is a bit irritating: the function is really used > for filling the page, but that's not always the "progress" we're looking > for. In the iomap_readpage case, we actually need to advance by an > antire page, but in the iomap_file_buffered_write case, we need to > advance by the length parameter of iomap_write_actor or less. So I've > changed that back. > > I've also renamed iomap_inline_buf to iomap_inline_data and I've turned > iomap_inline_data_size_valid into iomap_within_inline_data, which seems > more useful to me. > > Thanks, > Andreas > > -- > > Subject: [PATCH] iomap: Support tail packing > > The existing inline data support only works for cases where the entire > file is stored as inline data. For larger files, EROFS stores the > initial blocks separately and then can pack a small tail adjacent to the > inode. Generalise inline data to allow for tail packing. Tails may not > cross a page boundary in memory. > > We currently have no filesystems that support tail packing and writing, > so that case is currently disabled (see iomap_write_begin_inline). I'm > not aware of any reason why this code path shouldn't work, however. > > Cc: Christoph Hellwig <hch@lst.de> > Cc: Darrick J. Wong <djwong@kernel.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> > Tested-by: Huang Jianan <huangjianan@oppo.com> # erofs > Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> > --- > fs/iomap/buffered-io.c | 34 +++++++++++++++++++++++----------- > fs/iomap/direct-io.c | 11 ++++++----- > include/linux/iomap.h | 22 +++++++++++++++++++++- > 3 files changed, 50 insertions(+), 17 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 87ccb3438bec..334bf98fdd4a 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -205,25 +205,29 @@ struct iomap_readpage_ctx { > struct readahead_control *rac; > }; > > -static void > -iomap_read_inline_data(struct inode *inode, struct page *page, > +static int iomap_read_inline_data(struct inode *inode, struct page *page, > struct iomap *iomap) > { > - size_t size = i_size_read(inode); > + size_t size = i_size_read(inode) - iomap->offset; > void *addr; > > if (PageUptodate(page)) > - return; > + return 0; > > - BUG_ON(page_has_private(page)); > - BUG_ON(page->index); > - BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data)); > + /* inline and tail-packed data must start page aligned in the file */ > + if (WARN_ON_ONCE(offset_in_page(iomap->offset))) > + return -EIO; > + if (WARN_ON_ONCE(size > PAGE_SIZE - offset_in_page(iomap->inline_data))) > + return -EIO; > + if (WARN_ON_ONCE(page_has_private(page))) > + return -EIO; > > addr = kmap_atomic(page); > memcpy(addr, iomap->inline_data, size); > memset(addr + size, 0, PAGE_SIZE - size); > kunmap_atomic(addr); > SetPageUptodate(page); > + return 0; > } > > static inline bool iomap_block_needs_zeroing(struct inode *inode, > @@ -247,7 +251,6 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > sector_t sector; > > if (iomap->type == IOMAP_INLINE) { > - WARN_ON_ONCE(pos); > iomap_read_inline_data(inode, page, iomap); > return PAGE_SIZE; > } > @@ -589,6 +592,15 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags, > return 0; > } > > +static int iomap_write_begin_inline(struct inode *inode, > + struct page *page, struct iomap *srcmap) > +{ > + /* needs more work for the tailpacking case, disable for now */ > + if (WARN_ON_ONCE(srcmap->offset != 0)) > + return -EIO; > + return iomap_read_inline_data(inode, page, srcmap); > +} > + > static int > iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > struct page **pagep, struct iomap *iomap, struct iomap *srcmap) > @@ -618,7 +630,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > } > > if (srcmap->type == IOMAP_INLINE) > - iomap_read_inline_data(inode, page, srcmap); > + status = iomap_write_begin_inline(inode, page, srcmap); > else if (iomap->flags & IOMAP_F_BUFFER_HEAD) > status = __block_write_begin_int(page, pos, len, NULL, srcmap); > else > @@ -671,11 +683,11 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page, > void *addr; > > WARN_ON_ONCE(!PageUptodate(page)); > - BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data)); > + BUG_ON(!iomap_within_inline_data(iomap, pos + copied - 1)); > > flush_dcache_page(page); > addr = kmap_atomic(page); > - memcpy(iomap->inline_data + pos, addr + pos, copied); > + memcpy(iomap_inline_data(iomap, pos), addr + pos, copied); > kunmap_atomic(addr); > > mark_inode_dirty(inode); > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 9398b8c31323..c9424e58f613 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -380,21 +380,22 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length, > struct iov_iter *iter = dio->submit.iter; > size_t copied; > > - BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data)); > + if (WARN_ON_ONCE(!iomap_within_inline_data(iomap, pos + length - 1))) > + return -EIO; I also wonder what is wrong with the previous patch: + if (WARN_ON_ONCE(!iomap_inline_data_size_valid(iomap))) + return -EIO; +/* + * iomap->inline_data is a potentially kmapped page, ensure it never crosses a + * page boundary. + */ +static inline bool iomap_inline_data_size_valid(const struct iomap *iomap) +{ + return iomap->length <= PAGE_SIZE - offset_in_page(iomap->inline_data); +} In principle, the relationship of iomap->offset, pos, length and iomap->length is: " iomap->offset <= pos < pos + length <= iomap->offset + iomap->length " pos and pos + length are also impacted by what user requests rather than the original extent itself reported by fs. Here we need to make sure the whole extent in the page, so I think it'd be better to check with iomap->length rather than some pos, length related stuffs. > > if (dio->flags & IOMAP_DIO_WRITE) { > - loff_t size = inode->i_size; > + loff_t size = iomap->offset + iomap->length; and here, since it's the last extent and due to the current limitation in practice, iomap->offset + iomap->length == inode->i_size, yet I wonder why this part uses iomap->length to calculate instead of using i_size as in iomap_read_inline_data(). My thought is "here it handles the i_size pointer and append write", so I think "loff_t size = inode->i_size" makes more sense here. > > if (pos > size) > - memset(iomap->inline_data + size, 0, pos - size); > - copied = copy_from_iter(iomap->inline_data + pos, length, iter); > + memset(iomap_inline_data(iomap, size), 0, pos - size); > + copied = copy_from_iter(iomap_inline_data(iomap, pos), length, iter); iomap_inline_buf() was suggested by Darrick. From my point of view, I think it's better since it's a part of iomap->inline_data due to pos involved. Thanks, Gao Xiang
On Mon, Jul 26, 2021 at 5:07 AM Matthew Wilcox <willy@infradead.org> wrote: > On Mon, Jul 26, 2021 at 12:16:39AM +0200, Andreas Gruenbacher wrote: > > @@ -247,7 +251,6 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > sector_t sector; > > > > if (iomap->type == IOMAP_INLINE) { > > - WARN_ON_ONCE(pos); > > iomap_read_inline_data(inode, page, iomap); > > return PAGE_SIZE; > > This surely needs to return -EIO if there was an error. Hmm, right. Thanks, Andreas
On Mon, Jul 26, 2021 at 4:36 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > On Mon, Jul 26, 2021 at 12:16:39AM +0200, Andreas Gruenbacher wrote: > > Here's a fixed and cleaned up version that passes fstests on gfs2. > > > > I see no reason why the combination of tail packing + writing should > > cause any issues, so in my opinion, the check that disables that > > combination in iomap_write_begin_inline should still be removed. > > Since there is no such fs for tail-packing write, I just do a wild > guess, for example, > 1) the tail-end block was not inlined, so iomap_write_end() dirtied > the whole page (or buffer) for the page writeback; > 2) then it was truncated into a tail-packing inline block so the last > extent(page) became INLINE but dirty instead; > 3) during the late page writeback for dirty pages, > if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE)) > would be triggered in iomap_writepage_map() for such dirty page. > > As Matthew pointed out before, > https://lore.kernel.org/r/YPrms0fWPwEZGNAL@casper.infradead.org/ > currently tail-packing inline won't interact with page writeback, but > I'm afraid a supported tail-packing write fs needs to reconsider the > whole stuff how page, inode writeback works and what the pattern is > with the tail-packing. > > > > > It turns out that returning the number of bytes copied from > > iomap_read_inline_data is a bit irritating: the function is really used > > for filling the page, but that's not always the "progress" we're looking > > for. In the iomap_readpage case, we actually need to advance by an > > antire page, but in the iomap_file_buffered_write case, we need to > > advance by the length parameter of iomap_write_actor or less. So I've > > changed that back. > > > > I've also renamed iomap_inline_buf to iomap_inline_data and I've turned > > iomap_inline_data_size_valid into iomap_within_inline_data, which seems > > more useful to me. > > > > Thanks, > > Andreas > > > > -- > > > > Subject: [PATCH] iomap: Support tail packing > > > > The existing inline data support only works for cases where the entire > > file is stored as inline data. For larger files, EROFS stores the > > initial blocks separately and then can pack a small tail adjacent to the > > inode. Generalise inline data to allow for tail packing. Tails may not > > cross a page boundary in memory. > > > > We currently have no filesystems that support tail packing and writing, > > so that case is currently disabled (see iomap_write_begin_inline). I'm > > not aware of any reason why this code path shouldn't work, however. > > > > Cc: Christoph Hellwig <hch@lst.de> > > Cc: Darrick J. Wong <djwong@kernel.org> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> > > Tested-by: Huang Jianan <huangjianan@oppo.com> # erofs > > Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> > > --- > > fs/iomap/buffered-io.c | 34 +++++++++++++++++++++++----------- > > fs/iomap/direct-io.c | 11 ++++++----- > > include/linux/iomap.h | 22 +++++++++++++++++++++- > > 3 files changed, 50 insertions(+), 17 deletions(-) > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 87ccb3438bec..334bf98fdd4a 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -205,25 +205,29 @@ struct iomap_readpage_ctx { > > struct readahead_control *rac; > > }; > > > > -static void > > -iomap_read_inline_data(struct inode *inode, struct page *page, > > +static int iomap_read_inline_data(struct inode *inode, struct page *page, > > struct iomap *iomap) > > { > > - size_t size = i_size_read(inode); > > + size_t size = i_size_read(inode) - iomap->offset; > > I wonder why you use i_size / iomap->offset here, This function is supposed to copy the inline or tail data at iomap->inline_data into the page passed to it. Logically, the inline data starts at iomap->offset and extends until i_size_read(inode). Relative to the page, the inline data starts at offset 0 and extends until i_size_read(inode) - iomap->offset. It's as simple as that. > and why you completely ignoring iomap->length field returning by fs. In the iomap_readpage case (iomap_begin with flags == 0), iomap->length will be the amount of data up to the end of the inode. In the iomap_file_buffered_write case (iomap_begin with flags == IOMAP_WRITE), iomap->length will be the size of iomap->inline_data. (For extending writes, we need to write beyond the current end of inode.) So iomap->length isn't all that useful for iomap_read_inline_data. > Using i_size here instead of iomap->length seems coupling to me in the > beginning (even currently in practice there is some limitation.) And what is that? Thanks, Andreas
On Mon, Jul 26, 2021 at 09:22:41AM +0200, Andreas Gruenbacher wrote: > On Mon, Jul 26, 2021 at 4:36 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > On Mon, Jul 26, 2021 at 12:16:39AM +0200, Andreas Gruenbacher wrote: > > > Here's a fixed and cleaned up version that passes fstests on gfs2. > > > > > > I see no reason why the combination of tail packing + writing should > > > cause any issues, so in my opinion, the check that disables that > > > combination in iomap_write_begin_inline should still be removed. > > > > Since there is no such fs for tail-packing write, I just do a wild > > guess, for example, > > 1) the tail-end block was not inlined, so iomap_write_end() dirtied > > the whole page (or buffer) for the page writeback; > > 2) then it was truncated into a tail-packing inline block so the last > > extent(page) became INLINE but dirty instead; > > 3) during the late page writeback for dirty pages, > > if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE)) > > would be triggered in iomap_writepage_map() for such dirty page. > > > > As Matthew pointed out before, > > https://lore.kernel.org/r/YPrms0fWPwEZGNAL@casper.infradead.org/ > > currently tail-packing inline won't interact with page writeback, but > > I'm afraid a supported tail-packing write fs needs to reconsider the > > whole stuff how page, inode writeback works and what the pattern is > > with the tail-packing. > > > > > > > > It turns out that returning the number of bytes copied from > > > iomap_read_inline_data is a bit irritating: the function is really used > > > for filling the page, but that's not always the "progress" we're looking > > > for. In the iomap_readpage case, we actually need to advance by an > > > antire page, but in the iomap_file_buffered_write case, we need to > > > advance by the length parameter of iomap_write_actor or less. So I've > > > changed that back. > > > > > > I've also renamed iomap_inline_buf to iomap_inline_data and I've turned > > > iomap_inline_data_size_valid into iomap_within_inline_data, which seems > > > more useful to me. > > > > > > Thanks, > > > Andreas > > > > > > -- > > > > > > Subject: [PATCH] iomap: Support tail packing > > > > > > The existing inline data support only works for cases where the entire > > > file is stored as inline data. For larger files, EROFS stores the > > > initial blocks separately and then can pack a small tail adjacent to the > > > inode. Generalise inline data to allow for tail packing. Tails may not > > > cross a page boundary in memory. > > > > > > We currently have no filesystems that support tail packing and writing, > > > so that case is currently disabled (see iomap_write_begin_inline). I'm > > > not aware of any reason why this code path shouldn't work, however. > > > > > > Cc: Christoph Hellwig <hch@lst.de> > > > Cc: Darrick J. Wong <djwong@kernel.org> > > > Cc: Matthew Wilcox <willy@infradead.org> > > > Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> > > > Tested-by: Huang Jianan <huangjianan@oppo.com> # erofs > > > Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> > > > --- > > > fs/iomap/buffered-io.c | 34 +++++++++++++++++++++++----------- > > > fs/iomap/direct-io.c | 11 ++++++----- > > > include/linux/iomap.h | 22 +++++++++++++++++++++- > > > 3 files changed, 50 insertions(+), 17 deletions(-) > > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > > index 87ccb3438bec..334bf98fdd4a 100644 > > > --- a/fs/iomap/buffered-io.c > > > +++ b/fs/iomap/buffered-io.c > > > @@ -205,25 +205,29 @@ struct iomap_readpage_ctx { > > > struct readahead_control *rac; > > > }; > > > > > > -static void > > > -iomap_read_inline_data(struct inode *inode, struct page *page, > > > +static int iomap_read_inline_data(struct inode *inode, struct page *page, > > > struct iomap *iomap) > > > { > > > - size_t size = i_size_read(inode); > > > + size_t size = i_size_read(inode) - iomap->offset; > > > > I wonder why you use i_size / iomap->offset here, > > This function is supposed to copy the inline or tail data at > iomap->inline_data into the page passed to it. Logically, the inline > data starts at iomap->offset and extends until i_size_read(inode). > Relative to the page, the inline data starts at offset 0 and extends > until i_size_read(inode) - iomap->offset. It's as simple as that. > > > and why you completely ignoring iomap->length field returning by fs. > > In the iomap_readpage case (iomap_begin with flags == 0), > iomap->length will be the amount of data up to the end of the inode. > In the iomap_file_buffered_write case (iomap_begin with flags == > IOMAP_WRITE), iomap->length will be the size of iomap->inline_data. > (For extending writes, we need to write beyond the current end of > inode.) So iomap->length isn't all that useful for > iomap_read_inline_data. > > > Using i_size here instead of iomap->length seems coupling to me in the > > beginning (even currently in practice there is some limitation.) > > And what is that? In short, I'm not against your modification. Since from my own point of view, these are all minor stuffs. No matter my v7 or your patch attached. As Darrick said before, "at least not muddy the waters further." What I need to confirm is to make sure the functionality works both on gfs2 and erofs, and merge into iomap for-next so I could rebase my development branch and do my own development then. Finally, would you minding add your own SOB on this and fix what Matthew pointed out? Thanks, Gao Xiang > > Thanks, > Andreas
Am Mo., 26. Juli 2021 um 06:00 Uhr schrieb Gao Xiang <hsiangkao@linux.alibaba.com>: > On Mon, Jul 26, 2021 at 12:16:39AM +0200, Andreas Gruenbacher wrote: > > Here's a fixed and cleaned up version that passes fstests on gfs2. > > (cont. > https://lore.kernel.org/r/YP4fk75mr%2FmIotDy@B-P7TQMD6M-0146.local) > > Would you mind listing what it fixed on gfs2 compared with v7? > IOWs, I wonder which case failed with v7 on gfs2 so I could recheck > this. The use of iomap->length or pos in iomap_read_inline_data is fundamentally broken, that's what I've fixed. > > I see no reason why the combination of tail packing + writing should > > cause any issues, so in my opinion, the check that disables that > > combination in iomap_write_begin_inline should still be removed. > > > > It turns out that returning the number of bytes copied from > > iomap_read_inline_data is a bit irritating: the function is really used > > for filling the page, but that's not always the "progress" we're looking > > for. In the iomap_readpage case, we actually need to advance by an > > antire page, but in the iomap_file_buffered_write case, we need to > > advance by the length parameter of iomap_write_actor or less. So I've > > changed that back. > > > > I've also renamed iomap_inline_buf to iomap_inline_data and I've turned > > iomap_inline_data_size_valid into iomap_within_inline_data, which seems > > more useful to me. > > > > Thanks, > > Andreas > > > > -- > > > > Subject: [PATCH] iomap: Support tail packing > > > > The existing inline data support only works for cases where the entire > > file is stored as inline data. For larger files, EROFS stores the > > initial blocks separately and then can pack a small tail adjacent to the > > inode. Generalise inline data to allow for tail packing. Tails may not > > cross a page boundary in memory. > > > > We currently have no filesystems that support tail packing and writing, > > so that case is currently disabled (see iomap_write_begin_inline). I'm > > not aware of any reason why this code path shouldn't work, however. > > > > Cc: Christoph Hellwig <hch@lst.de> > > Cc: Darrick J. Wong <djwong@kernel.org> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> > > Tested-by: Huang Jianan <huangjianan@oppo.com> # erofs > > Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> > > --- > > fs/iomap/buffered-io.c | 34 +++++++++++++++++++++++----------- > > fs/iomap/direct-io.c | 11 ++++++----- > > include/linux/iomap.h | 22 +++++++++++++++++++++- > > 3 files changed, 50 insertions(+), 17 deletions(-) > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 87ccb3438bec..334bf98fdd4a 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -205,25 +205,29 @@ struct iomap_readpage_ctx { > > struct readahead_control *rac; > > }; > > > > -static void > > -iomap_read_inline_data(struct inode *inode, struct page *page, > > +static int iomap_read_inline_data(struct inode *inode, struct page *page, > > struct iomap *iomap) > > { > > - size_t size = i_size_read(inode); > > + size_t size = i_size_read(inode) - iomap->offset; > > void *addr; > > > > if (PageUptodate(page)) > > - return; > > + return 0; > > > > - BUG_ON(page_has_private(page)); > > - BUG_ON(page->index); > > - BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data)); > > + /* inline and tail-packed data must start page aligned in the file */ > > + if (WARN_ON_ONCE(offset_in_page(iomap->offset))) > > + return -EIO; > > + if (WARN_ON_ONCE(size > PAGE_SIZE - offset_in_page(iomap->inline_data))) > > + return -EIO; > > + if (WARN_ON_ONCE(page_has_private(page))) > > + return -EIO; > > > > addr = kmap_atomic(page); > > memcpy(addr, iomap->inline_data, size); > > memset(addr + size, 0, PAGE_SIZE - size); > > kunmap_atomic(addr); > > SetPageUptodate(page); > > + return 0; > > } > > > > static inline bool iomap_block_needs_zeroing(struct inode *inode, > > @@ -247,7 +251,6 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > sector_t sector; > > > > if (iomap->type == IOMAP_INLINE) { > > - WARN_ON_ONCE(pos); > > iomap_read_inline_data(inode, page, iomap); > > return PAGE_SIZE; > > } > > @@ -589,6 +592,15 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags, > > return 0; > > } > > > > +static int iomap_write_begin_inline(struct inode *inode, > > + struct page *page, struct iomap *srcmap) > > +{ > > + /* needs more work for the tailpacking case, disable for now */ > > + if (WARN_ON_ONCE(srcmap->offset != 0)) > > + return -EIO; > > + return iomap_read_inline_data(inode, page, srcmap); > > +} > > + > > static int > > iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > > struct page **pagep, struct iomap *iomap, struct iomap *srcmap) > > @@ -618,7 +630,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > > } > > > > if (srcmap->type == IOMAP_INLINE) > > - iomap_read_inline_data(inode, page, srcmap); > > + status = iomap_write_begin_inline(inode, page, srcmap); > > else if (iomap->flags & IOMAP_F_BUFFER_HEAD) > > status = __block_write_begin_int(page, pos, len, NULL, srcmap); > > else > > @@ -671,11 +683,11 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page, > > void *addr; > > > > WARN_ON_ONCE(!PageUptodate(page)); > > - BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data)); > > + BUG_ON(!iomap_within_inline_data(iomap, pos + copied - 1)); > > > > flush_dcache_page(page); > > addr = kmap_atomic(page); > > - memcpy(iomap->inline_data + pos, addr + pos, copied); > > + memcpy(iomap_inline_data(iomap, pos), addr + pos, copied); > > kunmap_atomic(addr); > > > > mark_inode_dirty(inode); > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > index 9398b8c31323..c9424e58f613 100644 > > --- a/fs/iomap/direct-io.c > > +++ b/fs/iomap/direct-io.c > > @@ -380,21 +380,22 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length, > > struct iov_iter *iter = dio->submit.iter; > > size_t copied; > > > > - BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data)); > > + if (WARN_ON_ONCE(!iomap_within_inline_data(iomap, pos + length - 1))) > > + return -EIO; > > I also wonder what is wrong with the previous patch: > > + if (WARN_ON_ONCE(!iomap_inline_data_size_valid(iomap))) > + return -EIO; > > +/* > + * iomap->inline_data is a potentially kmapped page, ensure it never crosses a > + * page boundary. > + */ > +static inline bool iomap_inline_data_size_valid(const struct iomap *iomap) > +{ > + return iomap->length <= PAGE_SIZE - offset_in_page(iomap->inline_data); > +} > > In principle, the relationship of iomap->offset, pos, length and > iomap->length is: > > " iomap->offset <= pos < pos + length <= iomap->offset + > iomap->length " > > pos and pos + length are also impacted by what user requests rather > than the original extent itself reported by fs. > > Here we need to make sure the whole extent in the page, so I think > it'd be better to check with iomap->length rather than some pos, > length related stuffs. Yeah okay, the difference is that I've checked the validity of the range of the extent actually used and iomap_inline_data_size_valid checks if the entire mapping is valid. So let's stick with Christoph's previous version. > > > > if (dio->flags & IOMAP_DIO_WRITE) { > > - loff_t size = inode->i_size; > > + loff_t size = iomap->offset + iomap->length; > > and here, since it's the last extent and due to the current limitation > in practice, > iomap->offset + iomap->length == inode->i_size, > > yet I wonder why this part uses iomap->length to calculate instead of > using i_size as in iomap_read_inline_data(). > > My thought is "here it handles the i_size pointer and append write", > so I think "loff_t size = inode->i_size" makes more sense here. Hmm, right. > > if (pos > size) > > - memset(iomap->inline_data + size, 0, pos - size); > > - copied = copy_from_iter(iomap->inline_data + pos, length, iter); > > + memset(iomap_inline_data(iomap, size), 0, pos - size); > > + copied = copy_from_iter(iomap_inline_data(iomap, pos), length, iter); > > iomap_inline_buf() was suggested by Darrick. From my point of view, > I think it's better since it's a part of iomap->inline_data due to > pos involved. I find iomap_inline_buf a bit more confusing because it's not immediately obvious that "buf" == "data". I'll send an updated version. Thanks, Andreas
On 7/24/21 1:41 AM, Gao Xiang wrote: > Add support for reading inline data content into the page cache from > nonzero page-aligned file offsets. This enables the EROFS tailpacking > mode where the last few bytes of the file are stored right after the > inode. > > The buffered write path remains untouched since EROFS cannot be used > for testing. It'd be better to be implemented if upcoming real users > care and provide a real pattern rather than leave untested dead code > around. > > Cc: Christoph Hellwig <hch@lst.de> > Cc: Darrick J. Wong <djwong@kernel.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> > Tested-by: Huang Jianan <huangjianan@oppo.com> # erofs > Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Looks good to me. Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> > --- > v6: https://lore.kernel.org/r/20210722031729.51628-1-hsiangkao@linux.alibaba.com > changes since v6: > - based on Christoph's reply; > - update commit message suggested by Darrick; > - disable buffered write path until some real fs users. > > fs/iomap/buffered-io.c | 42 ++++++++++++++++++++++++++---------------- > fs/iomap/direct-io.c | 10 ++++++---- > include/linux/iomap.h | 14 ++++++++++++++ > 3 files changed, 46 insertions(+), 20 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 87ccb3438bec..f351e1f9e3f6 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -205,25 +205,29 @@ struct iomap_readpage_ctx { > struct readahead_control *rac; > }; > > -static void > -iomap_read_inline_data(struct inode *inode, struct page *page, > - struct iomap *iomap) > +static int iomap_read_inline_data(struct inode *inode, struct page *page, > + struct iomap *iomap, loff_t pos) > { > - size_t size = i_size_read(inode); > + size_t size = iomap->length + iomap->offset - pos; > void *addr; > > if (PageUptodate(page)) > - return; > + return PAGE_SIZE; > > - BUG_ON(page_has_private(page)); > - BUG_ON(page->index); > - BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data)); > + /* inline data must start page aligned in the file */ > + if (WARN_ON_ONCE(offset_in_page(pos))) > + return -EIO; > + if (WARN_ON_ONCE(!iomap_inline_data_size_valid(iomap))) > + return -EIO; > + if (WARN_ON_ONCE(page_has_private(page))) > + return -EIO; > > addr = kmap_atomic(page); > - memcpy(addr, iomap->inline_data, size); > + memcpy(addr, iomap_inline_buf(iomap, pos), size); > memset(addr + size, 0, PAGE_SIZE - size); > kunmap_atomic(addr); > SetPageUptodate(page); > + return PAGE_SIZE; > } > > static inline bool iomap_block_needs_zeroing(struct inode *inode, > @@ -246,11 +250,8 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > unsigned poff, plen; > sector_t sector; > > - if (iomap->type == IOMAP_INLINE) { > - WARN_ON_ONCE(pos); > - iomap_read_inline_data(inode, page, iomap); > - return PAGE_SIZE; > - } > + if (iomap->type == IOMAP_INLINE) > + return iomap_read_inline_data(inode, page, iomap, pos); > > /* zero post-eof blocks as the page may be mapped */ > iop = iomap_page_create(inode, page); > @@ -589,6 +590,15 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags, > return 0; > } > > +static int iomap_write_begin_inline(struct inode *inode, > + struct page *page, struct iomap *srcmap) > +{ > + /* needs more work for the tailpacking case, disable for now */ > + if (WARN_ON_ONCE(srcmap->offset != 0)) > + return -EIO; > + return iomap_read_inline_data(inode, page, srcmap, 0); > +} > + > static int > iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > struct page **pagep, struct iomap *iomap, struct iomap *srcmap) > @@ -618,14 +628,14 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > } > > if (srcmap->type == IOMAP_INLINE) > - iomap_read_inline_data(inode, page, srcmap); > + status = iomap_write_begin_inline(inode, page, srcmap); > else if (iomap->flags & IOMAP_F_BUFFER_HEAD) > status = __block_write_begin_int(page, pos, len, NULL, srcmap); > else > status = __iomap_write_begin(inode, pos, len, flags, page, > srcmap); > > - if (unlikely(status)) > + if (unlikely(status < 0)) > goto out_unlock; > > *pagep = page; > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 9398b8c31323..a6aaea2764a5 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -378,23 +378,25 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length, > struct iomap_dio *dio, struct iomap *iomap) > { > struct iov_iter *iter = dio->submit.iter; > + void *dst = iomap_inline_buf(iomap, pos); > size_t copied; > > - BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data)); > + if (WARN_ON_ONCE(!iomap_inline_data_size_valid(iomap))) > + return -EIO; > > if (dio->flags & IOMAP_DIO_WRITE) { > loff_t size = inode->i_size; > > if (pos > size) > - memset(iomap->inline_data + size, 0, pos - size); > - copied = copy_from_iter(iomap->inline_data + pos, length, iter); > + memset(iomap_inline_buf(iomap, size), 0, pos - size); > + copied = copy_from_iter(dst, length, iter); > if (copied) { > if (pos + copied > size) > i_size_write(inode, pos + copied); > mark_inode_dirty(inode); > } > } else { > - copied = copy_to_iter(iomap->inline_data + pos, length, iter); > + copied = copy_to_iter(dst, length, iter); > } > dio->size += copied; > return copied; > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 479c1da3e221..56b118c6d05c 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -97,6 +97,20 @@ iomap_sector(struct iomap *iomap, loff_t pos) > return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT; > } > > +static inline void *iomap_inline_buf(const struct iomap *iomap, loff_t pos) > +{ > + return iomap->inline_data - iomap->offset + pos; > +} > + > +/* > + * iomap->inline_data is a potentially kmapped page, ensure it never crosses a > + * page boundary. > + */ > +static inline bool iomap_inline_data_size_valid(const struct iomap *iomap) > +{ > + return iomap->length <= PAGE_SIZE - offset_in_page(iomap->inline_data); > +} > + > /* > * When a filesystem sets page_ops in an iomap mapping it returns, page_prepare > * and page_done will be called for each page written to. This only applies to >
On Mon, Jul 26, 2021 at 10:08:06AM +0200, Andreas Grünbacher wrote: > Am Mo., 26. Juli 2021 um 06:00 Uhr schrieb Gao Xiang ... > > > > iomap_inline_buf() was suggested by Darrick. From my point of view, > > I think it's better since it's a part of iomap->inline_data due to > > pos involved. > > I find iomap_inline_buf a bit more confusing because it's not > immediately obvious that "buf" == "data". > > I'll send an updated version. Okay, thank you very much! Thanks, Gao Xiang > > Thanks, > Andreas
Here's the promised update. Passes fstests on gfs2.
Thanks,
Andreas
--
Subject: iomap: Support tail packing
The existing inline data support only works for cases where the entire
file is stored as inline data. For larger files, EROFS stores the
initial blocks separately and then can pack a small tail adjacent to the
inode. Generalise inline data to allow for tail packing. Tails may not
cross a page boundary in memory.
We currently have no filesystems that support tail packing and writing,
so that case is currently disabled (see iomap_write_begin_inline). I'm
not aware of any reason why this code path shouldn't work, however.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
fs/iomap/buffered-io.c | 38 +++++++++++++++++++++++++-------------
fs/iomap/direct-io.c | 9 +++++----
include/linux/iomap.h | 20 +++++++++++++++++++-
3 files changed, 49 insertions(+), 18 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 87ccb3438bec..dee6b0952ef8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -205,25 +205,29 @@ struct iomap_readpage_ctx {
struct readahead_control *rac;
};
-static void
-iomap_read_inline_data(struct inode *inode, struct page *page,
+static int iomap_read_inline_data(struct inode *inode, struct page *page,
struct iomap *iomap)
{
- size_t size = i_size_read(inode);
+ size_t size = i_size_read(inode) - iomap->offset;
void *addr;
if (PageUptodate(page))
- return;
+ return 0;
- BUG_ON(page_has_private(page));
- BUG_ON(page->index);
- BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
+ /* inline and tail-packed data must start page aligned in the file */
+ if (WARN_ON_ONCE(offset_in_page(iomap->offset)))
+ return -EIO;
+ if (WARN_ON_ONCE(size > PAGE_SIZE - offset_in_page(iomap->inline_data)))
+ return -EIO;
+ if (WARN_ON_ONCE(page_has_private(page)))
+ return -EIO;
addr = kmap_atomic(page);
memcpy(addr, iomap->inline_data, size);
memset(addr + size, 0, PAGE_SIZE - size);
kunmap_atomic(addr);
SetPageUptodate(page);
+ return 0;
}
static inline bool iomap_block_needs_zeroing(struct inode *inode,
@@ -247,9 +251,8 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
sector_t sector;
if (iomap->type == IOMAP_INLINE) {
- WARN_ON_ONCE(pos);
- iomap_read_inline_data(inode, page, iomap);
- return PAGE_SIZE;
+ int ret = iomap_read_inline_data(inode, page, iomap);
+ return ret ?: PAGE_SIZE;
}
/* zero post-eof blocks as the page may be mapped */
@@ -589,6 +592,15 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
return 0;
}
+static int iomap_write_begin_inline(struct inode *inode,
+ struct page *page, struct iomap *srcmap)
+{
+ /* needs more work for the tailpacking case, disable for now */
+ if (WARN_ON_ONCE(srcmap->offset != 0))
+ return -EIO;
+ return iomap_read_inline_data(inode, page, srcmap);
+}
+
static int
iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
@@ -618,7 +630,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
}
if (srcmap->type == IOMAP_INLINE)
- iomap_read_inline_data(inode, page, srcmap);
+ status = iomap_write_begin_inline(inode, page, srcmap);
else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
status = __block_write_begin_int(page, pos, len, NULL, srcmap);
else
@@ -671,11 +683,11 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page,
void *addr;
WARN_ON_ONCE(!PageUptodate(page));
- BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
+ BUG_ON(!iomap_inline_data_size_valid(iomap));
flush_dcache_page(page);
addr = kmap_atomic(page);
- memcpy(iomap->inline_data + pos, addr + pos, copied);
+ memcpy(iomap_inline_data(iomap, pos), addr + pos, copied);
kunmap_atomic(addr);
mark_inode_dirty(inode);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9398b8c31323..6fdae86d0f1d 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -380,21 +380,22 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
struct iov_iter *iter = dio->submit.iter;
size_t copied;
- BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data));
+ if (WARN_ON_ONCE(!iomap_inline_data_size_valid(iomap)))
+ return -EIO;
if (dio->flags & IOMAP_DIO_WRITE) {
loff_t size = inode->i_size;
if (pos > size)
- memset(iomap->inline_data + size, 0, pos - size);
- copied = copy_from_iter(iomap->inline_data + pos, length, iter);
+ memset(iomap_inline_data(iomap, size), 0, pos - size);
+ copied = copy_from_iter(iomap_inline_data(iomap, pos), length, iter);
if (copied) {
if (pos + copied > size)
i_size_write(inode, pos + copied);
mark_inode_dirty(inode);
}
} else {
- copied = copy_to_iter(iomap->inline_data + pos, length, iter);
+ copied = copy_to_iter(iomap_inline_data(iomap, pos), length, iter);
}
dio->size += copied;
return copied;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 479c1da3e221..c6af1ef608c6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -28,7 +28,7 @@ struct vm_fault;
#define IOMAP_DELALLOC 1 /* delayed allocation blocks */
#define IOMAP_MAPPED 2 /* blocks allocated at @addr */
#define IOMAP_UNWRITTEN 3 /* blocks allocated at @addr in unwritten state */
-#define IOMAP_INLINE 4 /* data inline in the inode */
+#define IOMAP_INLINE 4 /* inline or tail-packed data */
/*
* Flags reported by the file system from iomap_begin:
@@ -97,6 +97,24 @@ iomap_sector(struct iomap *iomap, loff_t pos)
return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
}
+/*
+ * Returns the inline data pointer for logical offset @pos.
+ */
+static inline void *iomap_inline_data(struct iomap *iomap, loff_t pos)
+{
+ return iomap->inline_data + pos - iomap->offset;
+}
+
+/*
+ * Check if the mapping's length is within the valid range for inline data.
+ * This is used to guard against accessing data beyond the page inline_data
+ * points at.
+ */
+static inline bool iomap_inline_data_size_valid(struct iomap *iomap)
+{
+ return iomap->length <= PAGE_SIZE - offset_in_page(iomap->inline_data);
+}
+
/*
* When a filesystem sets page_ops in an iomap mapping it returns, page_prepare
* and page_done will be called for each page written to. This only applies to
> Subject: iomap: Support tail packing I can't say I like this "tail packing" language here when we have the perfectly fine inline wording. Same for various comments in the actual code. > + /* inline and tail-packed data must start page aligned in the file */ > + if (WARN_ON_ONCE(offset_in_page(iomap->offset))) > + return -EIO; > + if (WARN_ON_ONCE(size > PAGE_SIZE - offset_in_page(iomap->inline_data))) > + return -EIO; Why can't we use iomap_inline_data_size_valid here? That is how can size be different from iomap->legth? Shouldn't the offset_in_page also go into iomap_inline_data_size_valid, which should probably be called iomap_inline_data_valid then? > if (iomap->type == IOMAP_INLINE) { > + int ret = iomap_read_inline_data(inode, page, iomap); > + return ret ?: PAGE_SIZE; The ?: expression without the first leg is really confuing. Especially if a good old if is much more readable here. int ret = iomap_read_inline_data(inode, page, iomap); if (ret) return ret; return PAGE_SIZE; > + copied = copy_from_iter(iomap_inline_data(iomap, pos), length, iter); > + copied = copy_to_iter(iomap_inline_data(iomap, pos), length, iter); Pleae avoid the overly long lines.
Am Mo., 26. Juli 2021 um 14:17 Uhr schrieb Christoph Hellwig <hch@lst.de>: > > > Subject: iomap: Support tail packing > > I can't say I like this "tail packing" language here when we have the > perfectly fine inline wording. Same for various comments in the actual > code. > > > + /* inline and tail-packed data must start page aligned in the file */ > > + if (WARN_ON_ONCE(offset_in_page(iomap->offset))) > > + return -EIO; > > + if (WARN_ON_ONCE(size > PAGE_SIZE - offset_in_page(iomap->inline_data))) > > + return -EIO; > > Why can't we use iomap_inline_data_size_valid here? We can now. Gao, can you change that? > That is how can size be different from iomap->length? Quoting from my previous reply, "In the iomap_readpage case (iomap_begin with flags == 0), iomap->length will be the amount of data up to the end of the inode. In the iomap_file_buffered_write case (iomap_begin with flags == IOMAP_WRITE), iomap->length will be the size of iomap->inline_data. (For extending writes, we need to write beyond the current end of inode.) So iomap->length isn't all that useful for iomap_read_inline_data." > Shouldn't the offset_in_page also go into iomap_inline_data_size_valid, > which should probably be called iomap_inline_data_valid then? Hmm, not sure what you mean: iomap_inline_data_size_valid does take offset_in_page(iomap->inline_data) into account. > > if (iomap->type == IOMAP_INLINE) { > > + int ret = iomap_read_inline_data(inode, page, iomap); > > + return ret ?: PAGE_SIZE; > > The ?: expression without the first leg is really confuing. Especially > if a good old if is much more readable here. I'm sure Gao can change this. > int ret = iomap_read_inline_data(inode, page, iomap); > > if (ret) > return ret; > return PAGE_SIZE; > > > + copied = copy_from_iter(iomap_inline_data(iomap, pos), length, iter); > > > > + copied = copy_to_iter(iomap_inline_data(iomap, pos), length, iter); > > Pleae avoid the overly long lines. I thought people were okay with 80 character long lines? Thanks, Andreas
On Mon, Jul 26, 2021 at 01:06:11PM +0200, Andreas Gruenbacher wrote: > @@ -671,11 +683,11 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page, > void *addr; > > WARN_ON_ONCE(!PageUptodate(page)); > - BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data)); > + BUG_ON(!iomap_inline_data_size_valid(iomap)); > > flush_dcache_page(page); > addr = kmap_atomic(page); > - memcpy(iomap->inline_data + pos, addr + pos, copied); > + memcpy(iomap_inline_data(iomap, pos), addr + pos, copied); > kunmap_atomic(addr); > > mark_inode_dirty(inode); Only tangentially related ... why do we memcpy the data into the tail at write_end() time instead of at writepage() time? I see there's a workaround for that in gfs2's page_mkwrite(): if (gfs2_is_stuffed(ip)) { err = gfs2_unstuff_dinode(ip); (an mmap store cannot change the size of the file, so this would be unnecessary) Something like this ... diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 87ccb3438bec..3aeebe899fc5 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -665,9 +665,10 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len, return copied; } -static size_t iomap_write_end_inline(struct inode *inode, struct page *page, - struct iomap *iomap, loff_t pos, size_t copied) +static int iomap_write_inline_data(struct inode *inode, struct page *page, + struct iomap *iomap) { + size_t size = i_size_read(inode) - page_offset(page); void *addr; WARN_ON_ONCE(!PageUptodate(page)); @@ -675,11 +676,10 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page, flush_dcache_page(page); addr = kmap_atomic(page); - memcpy(iomap->inline_data + pos, addr + pos, copied); + memcpy(iomap->inline_data, addr, size); kunmap_atomic(addr); - mark_inode_dirty(inode); - return copied; + return 0; } /* Returns the number of bytes copied. May be 0. Cannot be an errno. */ @@ -691,9 +691,7 @@ static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len, loff_t old_size = inode->i_size; size_t ret; - if (srcmap->type == IOMAP_INLINE) { - ret = iomap_write_end_inline(inode, page, iomap, pos, copied); - } else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) { + if (srcmap->flags & IOMAP_F_BUFFER_HEAD) { ret = block_write_end(NULL, inode->i_mapping, pos, len, copied, page, NULL); } else { @@ -1314,6 +1312,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0); + if (wpc->iomap.type == IOMAP_INLINE) + return iomap_write_inline_data(inode, page, iomap); + /* * Walk through the page to find areas to write back. If we run off the * end of the current map or find the current map invalid, grab a new @@ -1328,8 +1329,6 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, error = wpc->ops->map_blocks(wpc, inode, file_offset); if (error) break; - if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE)) - continue; if (wpc->iomap.type == IOMAP_HOLE) continue; iomap_add_to_ioend(inode, file_offset, page, iop, wpc, wbc,
Hi Andreas, Christoph, On Mon, Jul 26, 2021 at 02:27:12PM +0200, Andreas Grünbacher wrote: > Am Mo., 26. Juli 2021 um 14:17 Uhr schrieb Christoph Hellwig <hch@lst.de>: > > > > > Subject: iomap: Support tail packing > > > > I can't say I like this "tail packing" language here when we have the > > perfectly fine inline wording. Same for various comments in the actual > > code. > > > > > + /* inline and tail-packed data must start page aligned in the file */ > > > + if (WARN_ON_ONCE(offset_in_page(iomap->offset))) > > > + return -EIO; > > > + if (WARN_ON_ONCE(size > PAGE_SIZE - offset_in_page(iomap->inline_data))) > > > + return -EIO; > > > > Why can't we use iomap_inline_data_size_valid here? > > We can now. Gao, can you change that? Thank you all taking so much time on this! much appreciated. I'm fine to update that. > > > That is how can size be different from iomap->length? > > Quoting from my previous reply, > > "In the iomap_readpage case (iomap_begin with flags == 0), > iomap->length will be the amount of data up to the end of the inode. For tail-packing cases, iomap->length is just the length of tail-packing inline extent. > In the iomap_file_buffered_write case (iomap_begin with flags == > IOMAP_WRITE), iomap->length will be the size of iomap->inline_data. > (For extending writes, we need to write beyond the current end of > inode.) So iomap->length isn't all that useful for > iomap_read_inline_data." Ok, now it seems I get your point. For the current gfs2 inline cases: iomap_write_begin iomap_write_begin_inline iomap_read_inline_data here, gfs2 passes a buffer instead with "iomap->length", maybe it could be larger than i_size_read(inode) for gfs2. Is that correct? loff_t max_size = gfs2_max_stuffed_size(ip); iomap->length = max_size; If that is what gfs2 currently does, I think it makes sense to temporarily use as this, but IMO, iomap->inline_bufsize is not iomap->length. These are 2 different concepts. > > > Shouldn't the offset_in_page also go into iomap_inline_data_size_valid, > > which should probably be called iomap_inline_data_valid then? > > Hmm, not sure what you mean: iomap_inline_data_size_valid does take > offset_in_page(iomap->inline_data) into account. > > > > if (iomap->type == IOMAP_INLINE) { > > > + int ret = iomap_read_inline_data(inode, page, iomap); > > > + return ret ?: PAGE_SIZE; > > > The ?: expression without the first leg is really confuing. Especially > > if a good old if is much more readable here. > > I'm sure Gao can change this. > > > int ret = iomap_read_inline_data(inode, page, iomap); > > > > if (ret) > > return ret; > > return PAGE_SIZE; I'm fine to update it if no strong opinion. > > > > > + copied = copy_from_iter(iomap_inline_data(iomap, pos), length, iter); > > > > > > > + copied = copy_to_iter(iomap_inline_data(iomap, pos), length, iter); > > > > Pleae avoid the overly long lines. > > I thought people were okay with 80 character long lines? Christoph mentioned before as below: https://lore.kernel.org/linux-fsdevel/YPVe41YqpfGLNsBS@infradead.org/ We also need to take the offset into account for the write side. I guess it would be nice to have a local variable for the inline address to not duplicate that calculation multiple times. Thanks, Gao Xiang > > Thanks, > Andreas
On Mon, Jul 26, 2021 at 2:33 PM Matthew Wilcox <willy@infradead.org> wrote: > On Mon, Jul 26, 2021 at 01:06:11PM +0200, Andreas Gruenbacher wrote: > > @@ -671,11 +683,11 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page, > > void *addr; > > > > WARN_ON_ONCE(!PageUptodate(page)); > > - BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data)); > > + BUG_ON(!iomap_inline_data_size_valid(iomap)); > > > > flush_dcache_page(page); > > addr = kmap_atomic(page); > > - memcpy(iomap->inline_data + pos, addr + pos, copied); > > + memcpy(iomap_inline_data(iomap, pos), addr + pos, copied); > > kunmap_atomic(addr); > > > > mark_inode_dirty(inode); > > Only tangentially related ... why do we memcpy the data into the tail > at write_end() time instead of at writepage() time? I see there's a > workaround for that in gfs2's page_mkwrite(): > > if (gfs2_is_stuffed(ip)) { > err = gfs2_unstuff_dinode(ip); > > (an mmap store cannot change the size of the file, so this would be > unnecessary) Not sure if an additional __set_page_dirty_nobuffers is needed in that case, but doing the writeback at writepage time should work just as well. It's just that gfs2 did it at write time historically. The un-inlining in gfs2_page_mkwrite() could probably also be removed. I can give this a try, but I'll unfortunately be AFK for the next couple of days. > Something like this ... > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 87ccb3438bec..3aeebe899fc5 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -665,9 +665,10 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len, > return copied; > } > > -static size_t iomap_write_end_inline(struct inode *inode, struct page *page, > - struct iomap *iomap, loff_t pos, size_t copied) > +static int iomap_write_inline_data(struct inode *inode, struct page *page, > + struct iomap *iomap) > { > + size_t size = i_size_read(inode) - page_offset(page); You surely mean inode->i_size - iomap->offset. > void *addr; > > WARN_ON_ONCE(!PageUptodate(page)); > @@ -675,11 +676,10 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page, > > flush_dcache_page(page); > addr = kmap_atomic(page); > - memcpy(iomap->inline_data + pos, addr + pos, copied); > + memcpy(iomap->inline_data, addr, size); > kunmap_atomic(addr); > > - mark_inode_dirty(inode); > - return copied; > + return 0; > } > > /* Returns the number of bytes copied. May be 0. Cannot be an errno. */ > @@ -691,9 +691,7 @@ static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len, > loff_t old_size = inode->i_size; > size_t ret; > > - if (srcmap->type == IOMAP_INLINE) { > - ret = iomap_write_end_inline(inode, page, iomap, pos, copied); > - } else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) { > + if (srcmap->flags & IOMAP_F_BUFFER_HEAD) { > ret = block_write_end(NULL, inode->i_mapping, pos, len, copied, > page, NULL); > } else { > @@ -1314,6 +1312,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > > WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0); > > + if (wpc->iomap.type == IOMAP_INLINE) > + return iomap_write_inline_data(inode, page, iomap); > + > /* > * Walk through the page to find areas to write back. If we run off the > * end of the current map or find the current map invalid, grab a new > @@ -1328,8 +1329,6 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > error = wpc->ops->map_blocks(wpc, inode, file_offset); > if (error) > break; > - if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE)) > - continue; > if (wpc->iomap.type == IOMAP_HOLE) > continue; > iomap_add_to_ioend(inode, file_offset, page, iop, wpc, wbc, > Thanks, Andreas
On Mon, Jul 26, 2021 at 2:51 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > Hi Andreas, Christoph, > > On Mon, Jul 26, 2021 at 02:27:12PM +0200, Andreas Grünbacher wrote: > > Am Mo., 26. Juli 2021 um 14:17 Uhr schrieb Christoph Hellwig <hch@lst.de>: > > > > > > > Subject: iomap: Support tail packing > > > > > > I can't say I like this "tail packing" language here when we have the > > > perfectly fine inline wording. Same for various comments in the actual > > > code. > > > > > > > + /* inline and tail-packed data must start page aligned in the file */ > > > > + if (WARN_ON_ONCE(offset_in_page(iomap->offset))) > > > > + return -EIO; > > > > + if (WARN_ON_ONCE(size > PAGE_SIZE - offset_in_page(iomap->inline_data))) > > > > + return -EIO; > > > > > > Why can't we use iomap_inline_data_size_valid here? > > > > We can now. Gao, can you change that? > > Thank you all taking so much time on this! much appreciated. > > I'm fine to update that. > > > > > > That is how can size be different from iomap->length? > > > > Quoting from my previous reply, > > > > "In the iomap_readpage case (iomap_begin with flags == 0), > > iomap->length will be the amount of data up to the end of the inode. > > For tail-packing cases, iomap->length is just the length of tail-packing > inline extent. > > > In the iomap_file_buffered_write case (iomap_begin with flags == > > IOMAP_WRITE), iomap->length will be the size of iomap->inline_data. > > (For extending writes, we need to write beyond the current end of > > inode.) So iomap->length isn't all that useful for > > iomap_read_inline_data." > > Ok, now it seems I get your point. For the current gfs2 inline cases: > iomap_write_begin > iomap_write_begin_inline > iomap_read_inline_data > > here, gfs2 passes a buffer instead with "iomap->length", maybe it > could be larger than i_size_read(inode) for gfs2. Is that correct? > > loff_t max_size = gfs2_max_stuffed_size(ip); > > iomap->length = max_size; > > If that is what gfs2 currently does, I think it makes sense to > temporarily use as this, but IMO, iomap->inline_bufsize is not > iomap->length. These are 2 different concepts. > > > > > > Shouldn't the offset_in_page also go into iomap_inline_data_size_valid, > > > which should probably be called iomap_inline_data_valid then? > > > > Hmm, not sure what you mean: iomap_inline_data_size_valid does take > > offset_in_page(iomap->inline_data) into account. > > > > > > if (iomap->type == IOMAP_INLINE) { > > > > + int ret = iomap_read_inline_data(inode, page, iomap); > > > > + return ret ?: PAGE_SIZE; > > > > > The ?: expression without the first leg is really confuing. Especially > > > if a good old if is much more readable here. > > > > I'm sure Gao can change this. > > > > > int ret = iomap_read_inline_data(inode, page, iomap); > > > > > > if (ret) > > > return ret; > > > return PAGE_SIZE; > > I'm fine to update it if no strong opinion. > > > > > > > > + copied = copy_from_iter(iomap_inline_data(iomap, pos), length, iter); > > > > > > > > > > + copied = copy_to_iter(iomap_inline_data(iomap, pos), length, iter); > > > > > > Pleae avoid the overly long lines. > > > > I thought people were okay with 80 character long lines? > > Christoph mentioned before as below: > https://lore.kernel.org/linux-fsdevel/YPVe41YqpfGLNsBS@infradead.org/ > > We also need to take the offset into account for the write side. > I guess it would be nice to have a local variable for the inline > address to not duplicate that calculation multiple times. Fair enough, we could add a local variable: void *inline_data = iomap_inline_data(iomap, pos); and use that in the copy_from_iter and copy_to_iter. Why not. Thanks, Andreas
On Mon, Jul 26, 2021 at 03:03:14PM +0200, Andreas Gruenbacher wrote: > On Mon, Jul 26, 2021 at 2:33 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jul 26, 2021 at 01:06:11PM +0200, Andreas Gruenbacher wrote: > > > @@ -671,11 +683,11 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page, > > > void *addr; > > > > > > WARN_ON_ONCE(!PageUptodate(page)); > > > - BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data)); > > > + BUG_ON(!iomap_inline_data_size_valid(iomap)); > > > > > > flush_dcache_page(page); > > > addr = kmap_atomic(page); > > > - memcpy(iomap->inline_data + pos, addr + pos, copied); > > > + memcpy(iomap_inline_data(iomap, pos), addr + pos, copied); > > > kunmap_atomic(addr); > > > > > > mark_inode_dirty(inode); > > > > Only tangentially related ... why do we memcpy the data into the tail > > at write_end() time instead of at writepage() time? I see there's a > > workaround for that in gfs2's page_mkwrite(): > > > > if (gfs2_is_stuffed(ip)) { > > err = gfs2_unstuff_dinode(ip); > > > > (an mmap store cannot change the size of the file, so this would be > > unnecessary) > > Not sure if an additional __set_page_dirty_nobuffers is needed in that > case, but doing the writeback at writepage time should work just as > well. It's just that gfs2 did it at write time historically. The > un-inlining in gfs2_page_mkwrite() could probably also be removed. > > I can give this a try, but I'll unfortunately be AFK for the next > couple of days. I tend to leave it as another new story and can be resolved with another patch to improve it (or I will stuck in this, I need to do my own development stuff instead of spinning with this iomap patch since I can see this already work well for gfs2 and erofs), I will update the patch Andreas posted with Christoph's comments. Thanks, Gao Xiang
On Mon, Jul 26, 2021 at 02:27:12PM +0200, Andreas Grünbacher wrote: > > That is how can size be different from iomap->length? > > Quoting from my previous reply, > > "In the iomap_readpage case (iomap_begin with flags == 0), > iomap->length will be the amount of data up to the end of the inode. > In the iomap_file_buffered_write case (iomap_begin with flags == > IOMAP_WRITE), iomap->length will be the size of iomap->inline_data. > (For extending writes, we need to write beyond the current end of > inode.) So iomap->length isn't all that useful for > iomap_read_inline_data." I think we should fix that now that we have the srcmap concept. That is or IOMAP_WRITE|IOMAP_ZERO return the inline map as the soure map, and return the actual block map we plan to write into as the main iomap. > > > Shouldn't the offset_in_page also go into iomap_inline_data_size_valid, > > which should probably be called iomap_inline_data_valid then? > > Hmm, not sure what you mean: iomap_inline_data_size_valid does take > offset_in_page(iomap->inline_data) into account. Indeed, orry for the braino. > I thought people were okay with 80 character long lines? No.
On Mon, Jul 26, 2021 at 09:12:44PM +0800, Gao Xiang wrote: > I tend to leave it as another new story and can be resolved with > another patch to improve it (or I will stuck in this, I need to do > my own development stuff instead of spinning with this iomap patch > since I can see this already work well for gfs2 and erofs), I will > update the patch Andreas posted with Christoph's comments. Yes, let's do the minimal work for erofs needs in a first step. After that I hope I can get the iomap_iter change in and then we can start the next projects. That being said moving the copy back to the inline data to writepage does seem very useful, as does the "small" blocksize support from Matthew.
On Mon, Jul 26, 2021 at 09:22:41AM +0200, Andreas Gruenbacher wrote: > On Mon, Jul 26, 2021 at 4:36 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > On Mon, Jul 26, 2021 at 12:16:39AM +0200, Andreas Gruenbacher wrote: > > > Here's a fixed and cleaned up version that passes fstests on gfs2. > > > > > > I see no reason why the combination of tail packing + writing should > > > cause any issues, so in my opinion, the check that disables that > > > combination in iomap_write_begin_inline should still be removed. > > > > Since there is no such fs for tail-packing write, I just do a wild > > guess, for example, > > 1) the tail-end block was not inlined, so iomap_write_end() dirtied > > the whole page (or buffer) for the page writeback; > > 2) then it was truncated into a tail-packing inline block so the last > > extent(page) became INLINE but dirty instead; > > 3) during the late page writeback for dirty pages, > > if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE)) > > would be triggered in iomap_writepage_map() for such dirty page. > > > > As Matthew pointed out before, > > https://lore.kernel.org/r/YPrms0fWPwEZGNAL@casper.infradead.org/ > > currently tail-packing inline won't interact with page writeback, but > > I'm afraid a supported tail-packing write fs needs to reconsider the > > whole stuff how page, inode writeback works and what the pattern is > > with the tail-packing. > > > > > > > > It turns out that returning the number of bytes copied from > > > iomap_read_inline_data is a bit irritating: the function is really used > > > for filling the page, but that's not always the "progress" we're looking > > > for. In the iomap_readpage case, we actually need to advance by an > > > antire page, but in the iomap_file_buffered_write case, we need to > > > advance by the length parameter of iomap_write_actor or less. So I've > > > changed that back. > > > > > > I've also renamed iomap_inline_buf to iomap_inline_data and I've turned > > > iomap_inline_data_size_valid into iomap_within_inline_data, which seems > > > more useful to me. > > > > > > Thanks, > > > Andreas > > > > > > -- > > > > > > Subject: [PATCH] iomap: Support tail packing > > > > > > The existing inline data support only works for cases where the entire > > > file is stored as inline data. For larger files, EROFS stores the > > > initial blocks separately and then can pack a small tail adjacent to the > > > inode. Generalise inline data to allow for tail packing. Tails may not > > > cross a page boundary in memory. > > > > > > We currently have no filesystems that support tail packing and writing, > > > so that case is currently disabled (see iomap_write_begin_inline). I'm > > > not aware of any reason why this code path shouldn't work, however. > > > > > > Cc: Christoph Hellwig <hch@lst.de> > > > Cc: Darrick J. Wong <djwong@kernel.org> > > > Cc: Matthew Wilcox <willy@infradead.org> > > > Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> > > > Tested-by: Huang Jianan <huangjianan@oppo.com> # erofs > > > Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> > > > --- > > > fs/iomap/buffered-io.c | 34 +++++++++++++++++++++++----------- > > > fs/iomap/direct-io.c | 11 ++++++----- > > > include/linux/iomap.h | 22 +++++++++++++++++++++- > > > 3 files changed, 50 insertions(+), 17 deletions(-) > > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > > index 87ccb3438bec..334bf98fdd4a 100644 > > > --- a/fs/iomap/buffered-io.c > > > +++ b/fs/iomap/buffered-io.c > > > @@ -205,25 +205,29 @@ struct iomap_readpage_ctx { > > > struct readahead_control *rac; > > > }; > > > > > > -static void > > > -iomap_read_inline_data(struct inode *inode, struct page *page, > > > +static int iomap_read_inline_data(struct inode *inode, struct page *page, > > > struct iomap *iomap) > > > { > > > - size_t size = i_size_read(inode); > > > + size_t size = i_size_read(inode) - iomap->offset; > > > > I wonder why you use i_size / iomap->offset here, > > This function is supposed to copy the inline or tail data at > iomap->inline_data into the page passed to it. Logically, the inline > data starts at iomap->offset and extends until i_size_read(inode). > Relative to the page, the inline data starts at offset 0 and extends > until i_size_read(inode) - iomap->offset. It's as simple as that. It's only as simple as that because the inline data read code is overfit to the single use case (gfs2) that it supports. So far in its history, iomap has never had to support inline data regions that do not coincide or overlap with EOF, nor has it had to support regions that do not start at pos==0. That is why it is appropriate to use the memcpy -> memset -> return PAGE_SIZE pattern and short-circuit what we do everywhere else in iomap. For a non-inline readpage call, filesystems are allowed to return mappings for blocks beyond EOF. The call to iomap_adjust_read_range sets us up to read data from disk through the EOF block, and for the remainder of the page we zero the post-eof blocks within that page. IOWs, for reads, __gfs2_iomap_get probably ought to set iomap->length to gfs2_max_stuffed_size() like it does for writes, and we ought to generalize iomap_read_inline_data to stop copying after min(iomap->length, i_size_read() - iomap->offset) bytes. If it then discovers that it has indeed reached EOF, then we can zero the rest of the page and add that quantity to the number of bytes read. Right now for gfs2 the two arguments to min are always the same so the function omits all the bits that would make the zeroing actually conditional on whether we really hit EOF, and pass any copied size other than PAGE_SIZE back to iomap_readpage_actor. Given that we still don't have any filesystems that require us to support inline regions entirely below EOF I'm fine with omitting the general (and hence untestable) solution... for now. (I now think I understand why someone brought up inline data regions in the middle of files last week.) --D > > and why you completely ignoring iomap->length field returning by fs. > > In the iomap_readpage case (iomap_begin with flags == 0), > iomap->length will be the amount of data up to the end of the inode. > In the iomap_file_buffered_write case (iomap_begin with flags == > IOMAP_WRITE), iomap->length will be the size of iomap->inline_data. > (For extending writes, we need to write beyond the current end of > inode.) So iomap->length isn't all that useful for > iomap_read_inline_data. > > > Using i_size here instead of iomap->length seems coupling to me in the > > beginning (even currently in practice there is some limitation.) > > And what is that? > > Thanks, > Andreas >
Am Mo., 26. Juli 2021 um 23:36 Uhr schrieb Darrick J. Wong <djwong@kernel.org>: > On Mon, Jul 26, 2021 at 09:22:41AM +0200, Andreas Gruenbacher wrote: > > On Mon, Jul 26, 2021 at 4:36 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > > On Mon, Jul 26, 2021 at 12:16:39AM +0200, Andreas Gruenbacher wrote: > > > > Here's a fixed and cleaned up version that passes fstests on gfs2. > > > > > > > > I see no reason why the combination of tail packing + writing should > > > > cause any issues, so in my opinion, the check that disables that > > > > combination in iomap_write_begin_inline should still be removed. > > > > > > Since there is no such fs for tail-packing write, I just do a wild > > > guess, for example, > > > 1) the tail-end block was not inlined, so iomap_write_end() dirtied > > > the whole page (or buffer) for the page writeback; > > > 2) then it was truncated into a tail-packing inline block so the last > > > extent(page) became INLINE but dirty instead; > > > 3) during the late page writeback for dirty pages, > > > if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE)) > > > would be triggered in iomap_writepage_map() for such dirty page. > > > > > > As Matthew pointed out before, > > > https://lore.kernel.org/r/YPrms0fWPwEZGNAL@casper.infradead.org/ > > > currently tail-packing inline won't interact with page writeback, but > > > I'm afraid a supported tail-packing write fs needs to reconsider the > > > whole stuff how page, inode writeback works and what the pattern is > > > with the tail-packing. > > > > > > > > > > > It turns out that returning the number of bytes copied from > > > > iomap_read_inline_data is a bit irritating: the function is really used > > > > for filling the page, but that's not always the "progress" we're looking > > > > for. In the iomap_readpage case, we actually need to advance by an > > > > antire page, but in the iomap_file_buffered_write case, we need to > > > > advance by the length parameter of iomap_write_actor or less. So I've > > > > changed that back. > > > > > > > > I've also renamed iomap_inline_buf to iomap_inline_data and I've turned > > > > iomap_inline_data_size_valid into iomap_within_inline_data, which seems > > > > more useful to me. > > > > > > > > Thanks, > > > > Andreas > > > > > > > > -- > > > > > > > > Subject: [PATCH] iomap: Support tail packing > > > > > > > > The existing inline data support only works for cases where the entire > > > > file is stored as inline data. For larger files, EROFS stores the > > > > initial blocks separately and then can pack a small tail adjacent to the > > > > inode. Generalise inline data to allow for tail packing. Tails may not > > > > cross a page boundary in memory. > > > > > > > > We currently have no filesystems that support tail packing and writing, > > > > so that case is currently disabled (see iomap_write_begin_inline). I'm > > > > not aware of any reason why this code path shouldn't work, however. > > > > > > > > Cc: Christoph Hellwig <hch@lst.de> > > > > Cc: Darrick J. Wong <djwong@kernel.org> > > > > Cc: Matthew Wilcox <willy@infradead.org> > > > > Cc: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> > > > > Tested-by: Huang Jianan <huangjianan@oppo.com> # erofs > > > > Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> > > > > --- > > > > fs/iomap/buffered-io.c | 34 +++++++++++++++++++++++----------- > > > > fs/iomap/direct-io.c | 11 ++++++----- > > > > include/linux/iomap.h | 22 +++++++++++++++++++++- > > > > 3 files changed, 50 insertions(+), 17 deletions(-) > > > > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > > > index 87ccb3438bec..334bf98fdd4a 100644 > > > > --- a/fs/iomap/buffered-io.c > > > > +++ b/fs/iomap/buffered-io.c > > > > @@ -205,25 +205,29 @@ struct iomap_readpage_ctx { > > > > struct readahead_control *rac; > > > > }; > > > > > > > > -static void > > > > -iomap_read_inline_data(struct inode *inode, struct page *page, > > > > +static int iomap_read_inline_data(struct inode *inode, struct page *page, > > > > struct iomap *iomap) > > > > { > > > > - size_t size = i_size_read(inode); > > > > + size_t size = i_size_read(inode) - iomap->offset; > > > > > > I wonder why you use i_size / iomap->offset here, > > > > This function is supposed to copy the inline or tail data at > > iomap->inline_data into the page passed to it. Logically, the inline > > data starts at iomap->offset and extends until i_size_read(inode). > > Relative to the page, the inline data starts at offset 0 and extends > > until i_size_read(inode) - iomap->offset. It's as simple as that. > > It's only as simple as that because the inline data read code is overfit > to the single use case (gfs2) that it supports. So far in its history, > iomap has never had to support inline data regions that do not coincide > or overlap with EOF, nor has it had to support regions that do not start > at pos==0. That is why it is appropriate to use the memcpy -> memset -> > return PAGE_SIZE pattern and short-circuit what we do everywhere else in > iomap. > > For a non-inline readpage call, filesystems are allowed to return > mappings for blocks beyond EOF. The call to iomap_adjust_read_range > sets us up to read data from disk through the EOF block, and for the > remainder of the page we zero the post-eof blocks within that page. > > IOWs, for reads, __gfs2_iomap_get probably ought to set iomap->length to > gfs2_max_stuffed_size() like it does for writes, and we ought to > generalize iomap_read_inline_data to stop copying after > min(iomap->length, i_size_read() - iomap->offset) bytes. If it then > discovers that it has indeed reached EOF, then we can zero the rest of > the page and add that quantity to the number of bytes read. That sounds like a useful improvement. I'll give it a try. Thanks, Andreas > Right now for gfs2 the two arguments to min are always the same so the > function omits all the bits that would make the zeroing actually > conditional on whether we really hit EOF, and pass any copied size other > than PAGE_SIZE back to iomap_readpage_actor. Given that we still don't > have any filesystems that require us to support inline regions entirely > below EOF I'm fine with omitting the general (and hence untestable) > solution... for now. > > (I now think I understand why someone brought up inline data regions in > the middle of files last week.) > > --D > > > > and why you completely ignoring iomap->length field returning by fs. > > > > In the iomap_readpage case (iomap_begin with flags == 0), > > iomap->length will be the amount of data up to the end of the inode. > > In the iomap_file_buffered_write case (iomap_begin with flags == > > IOMAP_WRITE), iomap->length will be the size of iomap->inline_data. > > (For extending writes, we need to write beyond the current end of > > inode.) So iomap->length isn't all that useful for > > iomap_read_inline_data. > > > > > Using i_size here instead of iomap->length seems coupling to me in the > > > beginning (even currently in practice there is some limitation.) > > > > And what is that? > > > > Thanks, > > Andreas > >
On Mon, Jul 26, 2021 at 02:17:02PM +0200, Christoph Hellwig wrote: > > Subject: iomap: Support tail packing > > I can't say I like this "tail packing" language here when we have the > perfectly fine inline wording. Same for various comments in the actual > code. Yes please, don't call it tail-packing when it's an inline extent, we'll use that for btrfs eventually and conflating the two terms has been cofusing users. Except reiserfs, no linux filesystem does tail-packing.
On Tue, Jul 27, 2021 at 10:20:42AM +0200, David Sterba wrote: > On Mon, Jul 26, 2021 at 02:17:02PM +0200, Christoph Hellwig wrote: > > > Subject: iomap: Support tail packing > > > > I can't say I like this "tail packing" language here when we have the > > perfectly fine inline wording. Same for various comments in the actual > > code. > > Yes please, don't call it tail-packing when it's an inline extent, we'll > use that for btrfs eventually and conflating the two terms has been > cofusing users. Except reiserfs, no linux filesystem does tail-packing. Hmm ... I see what reiserfs does as packing tails of multiple files into one block. What gfs2 (and ext4) do is inline data. Erofs packs the tail of a single file into the same block as the inode. If I understand what btrfs does correctly, it stores data in the btree. But (like gfs2/ext4), it's only for the entire-file-is-small case, not for its-just-ten-bytes-into-the-last-block case. So what would you call what erofs is doing if not tail-packing? Wikipedia calls it https://en.wikipedia.org/wiki/Block_suballocation which doesn't quite fit. We need a phrase which means "this isn't just for small files but for small tails of large files".
On Tue, Jul 27, 2021 at 02:35:46PM +0100, Matthew Wilcox wrote: > On Tue, Jul 27, 2021 at 10:20:42AM +0200, David Sterba wrote: > > On Mon, Jul 26, 2021 at 02:17:02PM +0200, Christoph Hellwig wrote: > > > > Subject: iomap: Support tail packing > > > > > > I can't say I like this "tail packing" language here when we have the > > > perfectly fine inline wording. Same for various comments in the actual > > > code. > > > > Yes please, don't call it tail-packing when it's an inline extent, we'll > > use that for btrfs eventually and conflating the two terms has been > > cofusing users. Except reiserfs, no linux filesystem does tail-packing. > > Hmm ... I see what reiserfs does as packing tails of multiple files into > one block. What gfs2 (and ext4) do is inline data. Erofs packs the > tail of a single file into the same block as the inode. If I understand Plus each erofs block can have multiple inodes (thus multi-tail blocks) oo as long as the meta block itself can fit. No matter what it's called, it's a kind of inline data (I think inline means that data mixes with metadata according to [1]). I was called it tail-block inline initially... whatever. Hopefully, Darrick could update the v8 title if some concern here. [1] https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt Thanks, Gao Xiang > what btrfs does correctly, it stores data in the btree. But (like > gfs2/ext4), it's only for the entire-file-is-small case, not for > its-just-ten-bytes-into-the-last-block case. > > So what would you call what erofs is doing if not tail-packing? > Wikipedia calls it https://en.wikipedia.org/wiki/Block_suballocation > which doesn't quite fit. We need a phrase which means "this isn't > just for small files but for small tails of large files".
On Tue, Jul 27, 2021 at 02:35:46PM +0100, Matthew Wilcox wrote: > On Tue, Jul 27, 2021 at 10:20:42AM +0200, David Sterba wrote: > > On Mon, Jul 26, 2021 at 02:17:02PM +0200, Christoph Hellwig wrote: > > > > Subject: iomap: Support tail packing > > > > > > I can't say I like this "tail packing" language here when we have the > > > perfectly fine inline wording. Same for various comments in the actual > > > code. > > > > Yes please, don't call it tail-packing when it's an inline extent, we'll > > use that for btrfs eventually and conflating the two terms has been > > cofusing users. Except reiserfs, no linux filesystem does tail-packing. > > Hmm ... I see what reiserfs does as packing tails of multiple files into > one block. What gfs2 (and ext4) do is inline data. Erofs packs the > tail of a single file into the same block as the inode. If I understand > what btrfs does correctly, it stores data in the btree. But (like > gfs2/ext4), it's only for the entire-file-is-small case, not for > its-just-ten-bytes-into-the-last-block case. > > So what would you call what erofs is doing if not tail-packing? That indeed sounds like tail-packing and I was not aware of that, the docs I found were not clear what exactly was going on with the data stored inline. > Wikipedia calls it https://en.wikipedia.org/wiki/Block_suballocation > which doesn't quite fit. We need a phrase which means "this isn't > just for small files but for small tails of large files". So that's more generic than what we now have as inline files, so in the interface everybody sets 0 as start of the range while erofs can also set start of the last partial block.
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 87ccb3438bec..f351e1f9e3f6 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -205,25 +205,29 @@ struct iomap_readpage_ctx { struct readahead_control *rac; }; -static void -iomap_read_inline_data(struct inode *inode, struct page *page, - struct iomap *iomap) +static int iomap_read_inline_data(struct inode *inode, struct page *page, + struct iomap *iomap, loff_t pos) { - size_t size = i_size_read(inode); + size_t size = iomap->length + iomap->offset - pos; void *addr; if (PageUptodate(page)) - return; + return PAGE_SIZE; - BUG_ON(page_has_private(page)); - BUG_ON(page->index); - BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data)); + /* inline data must start page aligned in the file */ + if (WARN_ON_ONCE(offset_in_page(pos))) + return -EIO; + if (WARN_ON_ONCE(!iomap_inline_data_size_valid(iomap))) + return -EIO; + if (WARN_ON_ONCE(page_has_private(page))) + return -EIO; addr = kmap_atomic(page); - memcpy(addr, iomap->inline_data, size); + memcpy(addr, iomap_inline_buf(iomap, pos), size); memset(addr + size, 0, PAGE_SIZE - size); kunmap_atomic(addr); SetPageUptodate(page); + return PAGE_SIZE; } static inline bool iomap_block_needs_zeroing(struct inode *inode, @@ -246,11 +250,8 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, unsigned poff, plen; sector_t sector; - if (iomap->type == IOMAP_INLINE) { - WARN_ON_ONCE(pos); - iomap_read_inline_data(inode, page, iomap); - return PAGE_SIZE; - } + if (iomap->type == IOMAP_INLINE) + return iomap_read_inline_data(inode, page, iomap, pos); /* zero post-eof blocks as the page may be mapped */ iop = iomap_page_create(inode, page); @@ -589,6 +590,15 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags, return 0; } +static int iomap_write_begin_inline(struct inode *inode, + struct page *page, struct iomap *srcmap) +{ + /* needs more work for the tailpacking case, disable for now */ + if (WARN_ON_ONCE(srcmap->offset != 0)) + return -EIO; + return iomap_read_inline_data(inode, page, srcmap, 0); +} + static int iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, struct page **pagep, struct iomap *iomap, struct iomap *srcmap) @@ -618,14 +628,14 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, } if (srcmap->type == IOMAP_INLINE) - iomap_read_inline_data(inode, page, srcmap); + status = iomap_write_begin_inline(inode, page, srcmap); else if (iomap->flags & IOMAP_F_BUFFER_HEAD) status = __block_write_begin_int(page, pos, len, NULL, srcmap); else status = __iomap_write_begin(inode, pos, len, flags, page, srcmap); - if (unlikely(status)) + if (unlikely(status < 0)) goto out_unlock; *pagep = page; diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 9398b8c31323..a6aaea2764a5 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -378,23 +378,25 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length, struct iomap_dio *dio, struct iomap *iomap) { struct iov_iter *iter = dio->submit.iter; + void *dst = iomap_inline_buf(iomap, pos); size_t copied; - BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data)); + if (WARN_ON_ONCE(!iomap_inline_data_size_valid(iomap))) + return -EIO; if (dio->flags & IOMAP_DIO_WRITE) { loff_t size = inode->i_size; if (pos > size) - memset(iomap->inline_data + size, 0, pos - size); - copied = copy_from_iter(iomap->inline_data + pos, length, iter); + memset(iomap_inline_buf(iomap, size), 0, pos - size); + copied = copy_from_iter(dst, length, iter); if (copied) { if (pos + copied > size) i_size_write(inode, pos + copied); mark_inode_dirty(inode); } } else { - copied = copy_to_iter(iomap->inline_data + pos, length, iter); + copied = copy_to_iter(dst, length, iter); } dio->size += copied; return copied; diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 479c1da3e221..56b118c6d05c 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -97,6 +97,20 @@ iomap_sector(struct iomap *iomap, loff_t pos) return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT; } +static inline void *iomap_inline_buf(const struct iomap *iomap, loff_t pos) +{ + return iomap->inline_data - iomap->offset + pos; +} + +/* + * iomap->inline_data is a potentially kmapped page, ensure it never crosses a + * page boundary. + */ +static inline bool iomap_inline_data_size_valid(const struct iomap *iomap) +{ + return iomap->length <= PAGE_SIZE - offset_in_page(iomap->inline_data); +} + /* * When a filesystem sets page_ops in an iomap mapping it returns, page_prepare * and page_done will be called for each page written to. This only applies to