Message ID | 20221028131109.977581-1-bfoster@redhat.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | xfs: redirty eof folio on truncate to avoid filemap flush | expand |
On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote: > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > > Here's a quick prototype of "option 3" described in my previous mail. > This has been spot tested and confirmed to prevent the original stale > data exposure problem. More thorough regression testing is still > required. Barring unforeseen issues with that, however, I think this is > tentatively my new preferred option. The primary reason for that is it > avoids looking at extent state and is more in line with what iomap based > zeroing should be doing more generically. > > Because of that, I think this provides a bit more opportunity for follow > on fixes (there are other truncate/zeroing problems I've come across > during this investigation that still need fixing), cleanup and > consolidation of the zeroing code. For example, I think the trajectory > of this could look something like: > > - Genericize a bit more to handle all truncates. > - Repurpose iomap_truncate_page() (currently only used by XFS) into a > unique implementation from zero range that does explicit zeroing > instead of relying on pagecache truncate. > - Refactor XFS ranged zeroing to an abstraction that uses a combination > of iomap_zero_range() and the new iomap_truncate_page(). > After playing with this and thinking a bit more about the above, I think I managed to come up with an iomap_truncate_page() prototype that DTRT based on this. Only spot tested so far, needs to pass iomap_flags to the other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has other bugs/warts, etc. etc. This is just a quick prototype to demonstrate the idea, which is essentially to check dirty state along with extent state while under lock and transfer that state back to iomap so it can decide whether it can shortcut or forcibly perform the zero. In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while under lock and implies that the range is sub-block (single page). IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact dirty, so perform the zero via buffered write regardless of extent state. Brian --- 8< --- diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 91ee0b308e13..14a9734b2838 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) loff_t written = 0; /* already zeroed? we're done. */ - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) + if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) && + !(srcmap->flags & IOMAP_F_TRUNC_PAGE)) return length; do { @@ -916,6 +917,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) if (bytes > folio_size(folio) - offset) bytes = folio_size(folio) - offset; + trace_printk("%d: ino 0x%lx offset 0x%lx bytes 0x%lx\n", + __LINE__, folio->mapping->host->i_ino, offset, bytes); folio_zero_range(folio, offset, bytes); folio_mark_accessed(folio); @@ -933,6 +936,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) return written; } +static int +__iomap_zero_range(struct iomap_iter *iter, bool *did_zero, + const struct iomap_ops *ops) +{ + int ret; + + while ((ret = iomap_iter(iter, ops)) > 0) + iter->processed = iomap_zero_iter(iter, did_zero); + return ret; +} + int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, const struct iomap_ops *ops) @@ -943,11 +957,8 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, .len = len, .flags = IOMAP_ZERO, }; - int ret; - while ((ret = iomap_iter(&iter, ops)) > 0) - iter.processed = iomap_zero_iter(&iter, did_zero); - return ret; + return __iomap_zero_range(&iter, did_zero, ops); } EXPORT_SYMBOL_GPL(iomap_zero_range); @@ -957,11 +968,17 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, { unsigned int blocksize = i_blocksize(inode); unsigned int off = pos & (blocksize - 1); + struct iomap_iter iter = { + .inode = inode, + .pos = pos, + .len = blocksize - off, + .flags = IOMAP_ZERO | IOMAP_TRUNC_PAGE, + }; /* Block boundary? Nothing to do */ if (!off) return 0; - return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops); + return __iomap_zero_range(&iter, did_zero, ops); } EXPORT_SYMBOL_GPL(iomap_truncate_page); diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 07da03976ec1..16d9b838e82d 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -915,6 +915,7 @@ xfs_buffered_write_iomap_begin( int allocfork = XFS_DATA_FORK; int error = 0; unsigned int lockmode = XFS_ILOCK_EXCL; + u16 iomap_flags = 0; if (xfs_is_shutdown(mp)) return -EIO; @@ -942,6 +943,10 @@ xfs_buffered_write_iomap_begin( if (error) goto out_unlock; + if ((flags & IOMAP_TRUNC_PAGE) && + filemap_range_needs_writeback(VFS_I(ip)->i_mapping, offset, offset)) + iomap_flags |= IOMAP_F_TRUNC_PAGE; + /* * Search the data fork first to look up our source mapping. We * always need the data fork map, as we have to return it to the @@ -1100,7 +1105,7 @@ xfs_buffered_write_iomap_begin( found_imap: xfs_iunlock(ip, XFS_ILOCK_EXCL); - return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0); + return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags); found_cow: xfs_iunlock(ip, XFS_ILOCK_EXCL); diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 2e10e1c66ad6..3c40a81d6da0 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -840,16 +840,6 @@ xfs_setattr_size( error = xfs_zero_range(ip, oldsize, newsize - oldsize, &did_zeroing); } else { - /* - * iomap won't detect a dirty page over an unwritten block (or a - * cow block over a hole) and subsequently skips zeroing the - * newly post-EOF portion of the page. Flush the new EOF to - * convert the block before the pagecache truncate. - */ - error = filemap_write_and_wait_range(inode->i_mapping, newsize, - newsize); - if (error) - return error; error = xfs_truncate_page(ip, newsize, &did_zeroing); } diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 238a03087e17..6f9b4f991720 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -56,6 +56,7 @@ struct vm_fault; #define IOMAP_F_MERGED 0x08 #define IOMAP_F_BUFFER_HEAD 0x10 #define IOMAP_F_ZONE_APPEND 0x20 +#define IOMAP_F_TRUNC_PAGE 0x40 /* * Flags set by the core iomap code during operations: @@ -146,6 +147,7 @@ struct iomap_page_ops { #else #define IOMAP_DAX 0 #endif /* CONFIG_FS_DAX */ +#define IOMAP_TRUNC_PAGE (1 << 9) struct iomap_ops { /*
On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote: > On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote: > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > > > Here's a quick prototype of "option 3" described in my previous mail. > > This has been spot tested and confirmed to prevent the original stale > > data exposure problem. More thorough regression testing is still > > required. Barring unforeseen issues with that, however, I think this is > > tentatively my new preferred option. The primary reason for that is it > > avoids looking at extent state and is more in line with what iomap based > > zeroing should be doing more generically. > > > > Because of that, I think this provides a bit more opportunity for follow > > on fixes (there are other truncate/zeroing problems I've come across > > during this investigation that still need fixing), cleanup and > > consolidation of the zeroing code. For example, I think the trajectory > > of this could look something like: > > > > - Genericize a bit more to handle all truncates. > > - Repurpose iomap_truncate_page() (currently only used by XFS) into a > > unique implementation from zero range that does explicit zeroing > > instead of relying on pagecache truncate. > > - Refactor XFS ranged zeroing to an abstraction that uses a combination > > of iomap_zero_range() and the new iomap_truncate_page(). > > > > After playing with this and thinking a bit more about the above, I think > I managed to come up with an iomap_truncate_page() prototype that DTRT > based on this. Only spot tested so far, needs to pass iomap_flags to the > other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has > other bugs/warts, etc. etc. This is just a quick prototype to > demonstrate the idea, which is essentially to check dirty state along > with extent state while under lock and transfer that state back to iomap > so it can decide whether it can shortcut or forcibly perform the zero. > > In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while > under lock and implies that the range is sub-block (single page). > IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact > dirty, so perform the zero via buffered write regardless of extent > state. I'd much prefer we fix this in the iomap infrastructure - failing to zero dirty data in memory over an unwritten extent isn't an XFS bug, so we shouldn't be working around it in XFS like we did previously. I don't think this should be call "IOMAP_TRUNC_PAGE", though, because that indicates the caller context, not what we are asking the internal iomap code to do. What we are really asking is for iomap_zero_iter() to do is zero the page cache if it exists in memory, otherwise ignore unwritten/hole pages. Hence I think a name like IOMAP_ZERO_PAGECACHE is more appropriate, > > Brian > > --- 8< --- > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 91ee0b308e13..14a9734b2838 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > loff_t written = 0; > > /* already zeroed? we're done. */ > - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) > + if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) && > + !(srcmap->flags & IOMAP_F_TRUNC_PAGE)) > return length; Why even involve the filesystem in this? We can do this directly in iomap_zero_iter() with: if ((srcmap->type == IOMAP_HOLE) return; if (srcmap->type == IOMAP_UNWRITTEN) { if (!(iter->flags & IOMAP_ZERO_PAGECACHE)) return; if (!filemap_range_needs_writeback(inode->i_mapping, iomap->offset, iomap->offset + iomap->length)) return; } It probably also warrants a coment that a clean folio over EOF on an unwritten extent already contains zeros, so we're only interested in folios that *have been dirtied* over this extent. If it's under writeback, we should still be zeroing because it will shortly contain real data on disk and so it needs to be zeroed and redirtied.... > @@ -916,6 +917,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > if (bytes > folio_size(folio) - offset) > bytes = folio_size(folio) - offset; > > + trace_printk("%d: ino 0x%lx offset 0x%lx bytes 0x%lx\n", > + __LINE__, folio->mapping->host->i_ino, offset, bytes); > folio_zero_range(folio, offset, bytes); > folio_mark_accessed(folio); > > @@ -933,6 +936,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > return written; > } > > +static int > +__iomap_zero_range(struct iomap_iter *iter, bool *did_zero, > + const struct iomap_ops *ops) > +{ > + int ret; > + > + while ((ret = iomap_iter(iter, ops)) > 0) > + iter->processed = iomap_zero_iter(iter, did_zero); > + return ret; > +} I'd just leave this simple loop open coded in the two callers. > + > int > iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, > const struct iomap_ops *ops) > @@ -943,11 +957,8 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, > .len = len, > .flags = IOMAP_ZERO, > }; > - int ret; > > - while ((ret = iomap_iter(&iter, ops)) > 0) > - iter.processed = iomap_zero_iter(&iter, did_zero); > - return ret; > + return __iomap_zero_range(&iter, did_zero, ops); > } > EXPORT_SYMBOL_GPL(iomap_zero_range); > > @@ -957,11 +968,17 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, > { > unsigned int blocksize = i_blocksize(inode); > unsigned int off = pos & (blocksize - 1); > + struct iomap_iter iter = { > + .inode = inode, > + .pos = pos, > + .len = blocksize - off, > + .flags = IOMAP_ZERO | IOMAP_TRUNC_PAGE, > + }; > > /* Block boundary? Nothing to do */ > if (!off) > return 0; > - return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops); > + return __iomap_zero_range(&iter, did_zero, ops); > } > EXPORT_SYMBOL_GPL(iomap_truncate_page); > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 07da03976ec1..16d9b838e82d 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -915,6 +915,7 @@ xfs_buffered_write_iomap_begin( > int allocfork = XFS_DATA_FORK; > int error = 0; > unsigned int lockmode = XFS_ILOCK_EXCL; > + u16 iomap_flags = 0; > > if (xfs_is_shutdown(mp)) > return -EIO; > @@ -942,6 +943,10 @@ xfs_buffered_write_iomap_begin( > if (error) > goto out_unlock; > > + if ((flags & IOMAP_TRUNC_PAGE) && > + filemap_range_needs_writeback(VFS_I(ip)->i_mapping, offset, offset)) > + iomap_flags |= IOMAP_F_TRUNC_PAGE; As per above, I don't think we should be putting this check in the filesystem. That simplifies this a lot as filesystems don't need to know anything about how iomap manages the page cache for the filesystem... Cheers, Dave.
Hi Brian, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on xfs-linux/for-next] [also build test WARNING on linus/master v6.1-rc2 next-20221028] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Brian-Foster/xfs-redirty-eof-folio-on-truncate-to-avoid-filemap-flush/20221028-211253 base: https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next patch link: https://lore.kernel.org/r/20221028131109.977581-1-bfoster%40redhat.com patch subject: [PATCH] xfs: redirty eof folio on truncate to avoid filemap flush config: x86_64-randconfig-a014 compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/b16e2c202472f3fc46e66c37fdc7c381b968af75 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Brian-Foster/xfs-redirty-eof-folio-on-truncate-to-avoid-filemap-flush/20221028-211253 git checkout b16e2c202472f3fc46e66c37fdc7c381b968af75 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash fs/xfs/ If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> fs/xfs/xfs_iops.c:863:8: warning: variable 'eof_dirty' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] if (folio_test_dirty(eof_folio) || ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/xfs/xfs_iops.c:867:9: note: uninitialized use occurs here if (!eof_dirty) { ^~~~~~~~~ fs/xfs/xfs_iops.c:863:4: note: remove the 'if' if its condition is always true if (folio_test_dirty(eof_folio) || ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/xfs/xfs_iops.c:793:18: note: initialize the variable 'eof_dirty' to silence this warning bool eof_dirty; ^ = 0 1 warning generated. vim +863 fs/xfs/xfs_iops.c 772 773 /* 774 * Truncate file. Must have write permission and not be a directory. 775 * 776 * Caution: The caller of this function is responsible for calling 777 * setattr_prepare() or otherwise verifying the change is fine. 778 */ 779 STATIC int 780 xfs_setattr_size( 781 struct user_namespace *mnt_userns, 782 struct xfs_inode *ip, 783 struct iattr *iattr) 784 { 785 struct xfs_mount *mp = ip->i_mount; 786 struct inode *inode = VFS_I(ip); 787 struct folio *eof_folio = NULL; 788 xfs_off_t oldsize, newsize; 789 struct xfs_trans *tp; 790 int error; 791 uint lock_flags = 0; 792 bool did_zeroing = false; 793 bool eof_dirty; 794 795 ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); 796 ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL)); 797 ASSERT(S_ISREG(inode->i_mode)); 798 ASSERT((iattr->ia_valid & (ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET| 799 ATTR_MTIME_SET|ATTR_TIMES_SET)) == 0); 800 801 oldsize = inode->i_size; 802 newsize = iattr->ia_size; 803 804 /* 805 * Short circuit the truncate case for zero length files. 806 */ 807 if (newsize == 0 && oldsize == 0 && ip->i_df.if_nextents == 0) { 808 if (!(iattr->ia_valid & (ATTR_CTIME|ATTR_MTIME))) 809 return 0; 810 811 /* 812 * Use the regular setattr path to update the timestamps. 813 */ 814 iattr->ia_valid &= ~ATTR_SIZE; 815 return xfs_setattr_nonsize(mnt_userns, ip, iattr); 816 } 817 818 /* 819 * Make sure that the dquots are attached to the inode. 820 */ 821 error = xfs_qm_dqattach(ip); 822 if (error) 823 return error; 824 825 /* 826 * Wait for all direct I/O to complete. 827 */ 828 inode_dio_wait(inode); 829 830 /* 831 * File data changes must be complete before we start the transaction to 832 * modify the inode. This needs to be done before joining the inode to 833 * the transaction because the inode cannot be unlocked once it is a 834 * part of the transaction. 835 * 836 * Start with zeroing any data beyond EOF that we may expose on file 837 * extension, or zeroing out the rest of the block on a downward 838 * truncate. 839 */ 840 if (newsize > oldsize) { 841 trace_xfs_zero_eof(ip, oldsize, newsize - oldsize); 842 error = xfs_zero_range(ip, oldsize, newsize - oldsize, 843 &did_zeroing); 844 } else { 845 /* 846 * iomap won't detect a dirty folio over an unwritten block (or 847 * a cow block over a hole) and subsequently skips zeroing the 848 * newly post-EOF portion of the folio. Doing a flush here (i.e. 849 * as is done for fallocate ZERO_RANGE) updates extent state for 850 * iomap, but has too much overhead for the truncate path. 851 * 852 * Instead, check whether the new EOF is dirty in pagecache. If 853 * so, hold a reference across the pagecache truncate and dirty 854 * the folio. This ensures that partial folio zeroing from the 855 * truncate makes it to disk in the rare event that iomap skips 856 * zeroing and writeback happens to complete before the 857 * pagecache truncate. Note that this really should be handled 858 * properly by iomap zero range. 859 */ 860 eof_folio = filemap_lock_folio(inode->i_mapping, 861 (newsize - 1) >> PAGE_SHIFT); 862 if (eof_folio) { > 863 if (folio_test_dirty(eof_folio) || 864 folio_test_writeback(eof_folio)) 865 eof_dirty = true; 866 folio_unlock(eof_folio); 867 if (!eof_dirty) { 868 folio_put(eof_folio); 869 eof_folio = NULL; 870 } 871 } 872 error = xfs_truncate_page(ip, newsize, &did_zeroing); 873 } 874 875 if (error) { 876 if (eof_folio) 877 folio_put(eof_folio); 878 return error; 879 } 880 881 /* 882 * We've already locked out new page faults, so now we can safely remove 883 * pages from the page cache knowing they won't get refaulted until we 884 * drop the XFS_MMAP_EXCL lock after the extent manipulations are 885 * complete. The truncate_setsize() call also cleans partial EOF page 886 * PTEs on extending truncates and hence ensures sub-page block size 887 * filesystems are correctly handled, too. 888 * 889 * We have to do all the page cache truncate work outside the 890 * transaction context as the "lock" order is page lock->log space 891 * reservation as defined by extent allocation in the writeback path. 892 * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but 893 * having already truncated the in-memory version of the file (i.e. made 894 * user visible changes). There's not much we can do about this, except 895 * to hope that the caller sees ENOMEM and retries the truncate 896 * operation. 897 * 898 * And we update in-core i_size and truncate page cache beyond newsize 899 * before writeback the [i_disk_size, newsize] range, so we're 900 * guaranteed not to write stale data past the new EOF on truncate down. 901 */ 902 truncate_setsize(inode, newsize); 903 if (eof_folio) { 904 trace_printk("%d: ino 0x%llx newsize 0x%llx folio idx 0x%lx did_zeroing %d\n", 905 __LINE__, ip->i_ino, newsize, folio_index(eof_folio), did_zeroing); 906 if (!did_zeroing) { 907 filemap_dirty_folio(inode->i_mapping, eof_folio); 908 did_zeroing = true; 909 } 910 folio_put(eof_folio); 911 } 912 913 /* 914 * We are going to log the inode size change in this transaction so 915 * any previous writes that are beyond the on disk EOF and the new 916 * EOF that have not been written out need to be written here. If we 917 * do not write the data out, we expose ourselves to the null files 918 * problem. Note that this includes any block zeroing we did above; 919 * otherwise those blocks may not be zeroed after a crash. 920 */ 921 if (did_zeroing || 922 (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) { 923 error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, 924 ip->i_disk_size, newsize - 1); 925 if (error) 926 return error; 927 } 928 929 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); 930 if (error) 931 return error; 932 933 lock_flags |= XFS_ILOCK_EXCL; 934 xfs_ilock(ip, XFS_ILOCK_EXCL); 935 xfs_trans_ijoin(tp, ip, 0); 936 937 /* 938 * Only change the c/mtime if we are changing the size or we are 939 * explicitly asked to change it. This handles the semantic difference 940 * between truncate() and ftruncate() as implemented in the VFS. 941 * 942 * The regular truncate() case without ATTR_CTIME and ATTR_MTIME is a 943 * special case where we need to update the times despite not having 944 * these flags set. For all other operations the VFS set these flags 945 * explicitly if it wants a timestamp update. 946 */ 947 if (newsize != oldsize && 948 !(iattr->ia_valid & (ATTR_CTIME | ATTR_MTIME))) { 949 iattr->ia_ctime = iattr->ia_mtime = 950 current_time(inode); 951 iattr->ia_valid |= ATTR_CTIME | ATTR_MTIME; 952 } 953 954 /* 955 * The first thing we do is set the size to new_size permanently on 956 * disk. This way we don't have to worry about anyone ever being able 957 * to look at the data being freed even in the face of a crash. 958 * What we're getting around here is the case where we free a block, it 959 * is allocated to another file, it is written to, and then we crash. 960 * If the new data gets written to the file but the log buffers 961 * containing the free and reallocation don't, then we'd end up with 962 * garbage in the blocks being freed. As long as we make the new size 963 * permanent before actually freeing any blocks it doesn't matter if 964 * they get written to. 965 */ 966 ip->i_disk_size = newsize; 967 xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); 968 969 if (newsize <= oldsize) { 970 error = xfs_itruncate_extents(&tp, ip, XFS_DATA_FORK, newsize); 971 if (error) 972 goto out_trans_cancel; 973 974 /* 975 * Truncated "down", so we're removing references to old data 976 * here - if we delay flushing for a long time, we expose 977 * ourselves unduly to the notorious NULL files problem. So, 978 * we mark this inode and flush it when the file is closed, 979 * and do not wait the usual (long) time for writeout. 980 */ 981 xfs_iflags_set(ip, XFS_ITRUNCATED); 982 983 /* A truncate down always removes post-EOF blocks. */ 984 xfs_inode_clear_eofblocks_tag(ip); 985 } 986 987 ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID))); 988 setattr_copy(mnt_userns, inode, iattr); 989 xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); 990 991 XFS_STATS_INC(mp, xs_ig_attrchg); 992 993 if (xfs_has_wsync(mp)) 994 xfs_trans_set_sync(tp); 995 996 error = xfs_trans_commit(tp); 997 out_unlock: 998 if (lock_flags) 999 xfs_iunlock(ip, lock_flags); 1000 return error; 1001 1002 out_trans_cancel: 1003 xfs_trans_cancel(tp); 1004 goto out_unlock; 1005 } 1006
On Sat, Oct 29, 2022 at 08:30:14AM +1100, Dave Chinner wrote: > On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote: > > On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote: > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > > > > Here's a quick prototype of "option 3" described in my previous mail. > > > This has been spot tested and confirmed to prevent the original stale > > > data exposure problem. More thorough regression testing is still > > > required. Barring unforeseen issues with that, however, I think this is > > > tentatively my new preferred option. The primary reason for that is it > > > avoids looking at extent state and is more in line with what iomap based > > > zeroing should be doing more generically. > > > > > > Because of that, I think this provides a bit more opportunity for follow > > > on fixes (there are other truncate/zeroing problems I've come across > > > during this investigation that still need fixing), cleanup and > > > consolidation of the zeroing code. For example, I think the trajectory > > > of this could look something like: > > > > > > - Genericize a bit more to handle all truncates. > > > - Repurpose iomap_truncate_page() (currently only used by XFS) into a > > > unique implementation from zero range that does explicit zeroing > > > instead of relying on pagecache truncate. > > > - Refactor XFS ranged zeroing to an abstraction that uses a combination > > > of iomap_zero_range() and the new iomap_truncate_page(). > > > > > > > After playing with this and thinking a bit more about the above, I think > > I managed to come up with an iomap_truncate_page() prototype that DTRT > > based on this. Only spot tested so far, needs to pass iomap_flags to the > > other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has > > other bugs/warts, etc. etc. This is just a quick prototype to > > demonstrate the idea, which is essentially to check dirty state along > > with extent state while under lock and transfer that state back to iomap > > so it can decide whether it can shortcut or forcibly perform the zero. > > > > In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while > > under lock and implies that the range is sub-block (single page). > > IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact > > dirty, so perform the zero via buffered write regardless of extent > > state. > > I'd much prefer we fix this in the iomap infrastructure - failing to > zero dirty data in memory over an unwritten extent isn't an XFS bug, > so we shouldn't be working around it in XFS like we did previously. Hmm, I think I agree, given that this is really a bug in cache handling. Or so I gather; reading on... > I don't think this should be call "IOMAP_TRUNC_PAGE", though, > because that indicates the caller context, not what we are asking > the internal iomap code to do. What we are really asking is for > iomap_zero_iter() to do is zero the page cache if it exists in > memory, otherwise ignore unwritten/hole pages. Hence I think a name > like IOMAP_ZERO_PAGECACHE is more appropriate, I don't even like ZERO_PAGECACHE -- in my mind that implies that it unconditionally zeroes any page it finds, whereas we really only want it to zero dirty cache contents. IOMAP_ZERO_DIRTY_CACHE? > > > > Brian > > > > --- 8< --- > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 91ee0b308e13..14a9734b2838 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > > loff_t written = 0; > > > > /* already zeroed? we're done. */ > > - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) > > + if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) && > > + !(srcmap->flags & IOMAP_F_TRUNC_PAGE)) > > return length; > > Why even involve the filesystem in this? We can do this directly > in iomap_zero_iter() with: > > if ((srcmap->type == IOMAP_HOLE) > return; > if (srcmap->type == IOMAP_UNWRITTEN) { > if (!(iter->flags & IOMAP_ZERO_PAGECACHE)) > return; > if (!filemap_range_needs_writeback(inode->i_mapping, > iomap->offset, iomap->offset + iomap->length)) > return; > } > > It probably also warrants a coment that a clean folio over EOF on an > unwritten extent already contains zeros, so we're only interested in > folios that *have been dirtied* over this extent. If it's under > writeback, we should still be zeroing because it will shortly > contain real data on disk and so it needs to be zeroed and > redirtied.... Yes, please. I think we've screwed up eofpage handling enough in the past to warrant good documentation of these corner cases. > > > @@ -916,6 +917,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > > if (bytes > folio_size(folio) - offset) > > bytes = folio_size(folio) - offset; > > > > + trace_printk("%d: ino 0x%lx offset 0x%lx bytes 0x%lx\n", > > + __LINE__, folio->mapping->host->i_ino, offset, bytes); > > folio_zero_range(folio, offset, bytes); > > folio_mark_accessed(folio); > > > > @@ -933,6 +936,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > > return written; > > } > > > > +static int > > +__iomap_zero_range(struct iomap_iter *iter, bool *did_zero, > > + const struct iomap_ops *ops) > > +{ > > + int ret; > > + > > + while ((ret = iomap_iter(iter, ops)) > 0) > > + iter->processed = iomap_zero_iter(iter, did_zero); > > + return ret; > > +} > > I'd just leave this simple loop open coded in the two callers. > > > + > > int > > iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, > > const struct iomap_ops *ops) > > @@ -943,11 +957,8 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, > > .len = len, > > .flags = IOMAP_ZERO, > > }; > > - int ret; > > > > - while ((ret = iomap_iter(&iter, ops)) > 0) > > - iter.processed = iomap_zero_iter(&iter, did_zero); > > - return ret; > > + return __iomap_zero_range(&iter, did_zero, ops); > > } > > EXPORT_SYMBOL_GPL(iomap_zero_range); > > > > @@ -957,11 +968,17 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, > > { > > unsigned int blocksize = i_blocksize(inode); > > unsigned int off = pos & (blocksize - 1); > > + struct iomap_iter iter = { > > + .inode = inode, > > + .pos = pos, > > + .len = blocksize - off, > > + .flags = IOMAP_ZERO | IOMAP_TRUNC_PAGE, > > + }; > > > > /* Block boundary? Nothing to do */ > > if (!off) > > return 0; > > - return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops); > > + return __iomap_zero_range(&iter, did_zero, ops); > > } > > EXPORT_SYMBOL_GPL(iomap_truncate_page); > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > index 07da03976ec1..16d9b838e82d 100644 > > --- a/fs/xfs/xfs_iomap.c > > +++ b/fs/xfs/xfs_iomap.c > > @@ -915,6 +915,7 @@ xfs_buffered_write_iomap_begin( > > int allocfork = XFS_DATA_FORK; > > int error = 0; > > unsigned int lockmode = XFS_ILOCK_EXCL; > > + u16 iomap_flags = 0; > > > > if (xfs_is_shutdown(mp)) > > return -EIO; > > @@ -942,6 +943,10 @@ xfs_buffered_write_iomap_begin( > > if (error) > > goto out_unlock; > > > > + if ((flags & IOMAP_TRUNC_PAGE) && > > + filemap_range_needs_writeback(VFS_I(ip)->i_mapping, offset, offset)) > > + iomap_flags |= IOMAP_F_TRUNC_PAGE; > > As per above, I don't think we should be putting this check in the > filesystem. That simplifies this a lot as filesystems don't need to > know anything about how iomap manages the page cache for the > filesystem... I gather from the bug description that this appears to me to be a problem with how we manage the page cache during a truncation when the eofpage is backed by unwritten extents. As such, I think that this should be a fix within iomap. --D > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Oct 28, 2022 at 04:49:37PM -0700, Darrick J. Wong wrote: > On Sat, Oct 29, 2022 at 08:30:14AM +1100, Dave Chinner wrote: > > On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote: > > > On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote: > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > > --- > > > > > > > > Here's a quick prototype of "option 3" described in my previous mail. > > > > This has been spot tested and confirmed to prevent the original stale > > > > data exposure problem. More thorough regression testing is still > > > > required. Barring unforeseen issues with that, however, I think this is > > > > tentatively my new preferred option. The primary reason for that is it > > > > avoids looking at extent state and is more in line with what iomap based > > > > zeroing should be doing more generically. > > > > > > > > Because of that, I think this provides a bit more opportunity for follow > > > > on fixes (there are other truncate/zeroing problems I've come across > > > > during this investigation that still need fixing), cleanup and > > > > consolidation of the zeroing code. For example, I think the trajectory > > > > of this could look something like: > > > > > > > > - Genericize a bit more to handle all truncates. > > > > - Repurpose iomap_truncate_page() (currently only used by XFS) into a > > > > unique implementation from zero range that does explicit zeroing > > > > instead of relying on pagecache truncate. > > > > - Refactor XFS ranged zeroing to an abstraction that uses a combination > > > > of iomap_zero_range() and the new iomap_truncate_page(). > > > > > > > > > > After playing with this and thinking a bit more about the above, I think > > > I managed to come up with an iomap_truncate_page() prototype that DTRT > > > based on this. Only spot tested so far, needs to pass iomap_flags to the > > > other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has > > > other bugs/warts, etc. etc. This is just a quick prototype to > > > demonstrate the idea, which is essentially to check dirty state along > > > with extent state while under lock and transfer that state back to iomap > > > so it can decide whether it can shortcut or forcibly perform the zero. > > > > > > In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while > > > under lock and implies that the range is sub-block (single page). > > > IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact > > > dirty, so perform the zero via buffered write regardless of extent > > > state. > > > > I'd much prefer we fix this in the iomap infrastructure - failing to > > zero dirty data in memory over an unwritten extent isn't an XFS bug, > > so we shouldn't be working around it in XFS like we did previously. > > Hmm, I think I agree, given that this is really a bug in cache handling. > Or so I gather; reading on... > > > I don't think this should be call "IOMAP_TRUNC_PAGE", though, > > because that indicates the caller context, not what we are asking > > the internal iomap code to do. What we are really asking is for > > iomap_zero_iter() to do is zero the page cache if it exists in > > memory, otherwise ignore unwritten/hole pages. Hence I think a name > > like IOMAP_ZERO_PAGECACHE is more appropriate, > > I don't even like ZERO_PAGECACHE -- in my mind that implies that it > unconditionally zeroes any page it finds, whereas we really only want it > to zero dirty cache contents. IOMAP_ZERO_DIRTY_CACHE? Fine by me, the name just needs to describe the action that needs to be performed.... > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > index 07da03976ec1..16d9b838e82d 100644 > > > --- a/fs/xfs/xfs_iomap.c > > > +++ b/fs/xfs/xfs_iomap.c > > > @@ -915,6 +915,7 @@ xfs_buffered_write_iomap_begin( > > > int allocfork = XFS_DATA_FORK; > > > int error = 0; > > > unsigned int lockmode = XFS_ILOCK_EXCL; > > > + u16 iomap_flags = 0; > > > > > > if (xfs_is_shutdown(mp)) > > > return -EIO; > > > @@ -942,6 +943,10 @@ xfs_buffered_write_iomap_begin( > > > if (error) > > > goto out_unlock; > > > > > > + if ((flags & IOMAP_TRUNC_PAGE) && > > > + filemap_range_needs_writeback(VFS_I(ip)->i_mapping, offset, offset)) > > > + iomap_flags |= IOMAP_F_TRUNC_PAGE; > > > > As per above, I don't think we should be putting this check in the > > filesystem. That simplifies this a lot as filesystems don't need to > > know anything about how iomap manages the page cache for the > > filesystem... > > I gather from the bug description that this appears to me to be a > problem with how we manage the page cache during a truncation when the > eofpage is backed by unwritten extents. Right, think of iomap_truncate_page() as having exactly the same responsibilites as block_truncate_page() has for filesystems using bufferheads. i.e. both functions need to ensure the disk contents are correctly zeroed such that the caller can safely call truncate_setsize() afterwards resulting in both the on-disk state and in-memory state remaining coherent. Hence iomap_truncate_page() needs to ensure that we handle dirty data over unwritten extents correctly. If we look further, it is obvious that iomap already has this responsibility: SEEK_HOLE/SEEK_DATA does page cache lookups to find data over unwritten extents. Hence it makes no sense for one part of iomap to take responsibility for managing data over unwritten extents, whilst another part ignores it... If there was another filesystem using iomap and unwritten extents, it would have exactly the same issues with iomap_truncate_page(). Hence: > As such, I think that this > should be a fix within iomap. This. -Dave.
On Sun, Oct 30, 2022 at 09:01:31AM +1100, Dave Chinner wrote: > Right, think of iomap_truncate_page() as having exactly the same > responsibilites as block_truncate_page() has for filesystems using > bufferheads. i.e. both functions need to ensure the disk contents > are correctly zeroed such that the caller can safely call > truncate_setsize() afterwards resulting in both the on-disk state > and in-memory state remaining coherent. XFS always had these kinds of hacks even when it was using block_truncate_page. That being said, I fully agree that handling it in iomap is the absolutely right thing to do and we should have done it earlier.
On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote: > index 91ee0b308e13..14a9734b2838 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > loff_t written = 0; > > /* already zeroed? we're done. */ > - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) > + if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) && > + !(srcmap->flags & IOMAP_F_TRUNC_PAGE)) As mentioned elsewere in the thread I think we can just move the filemap_range_needs_writeback here, which removes the need for IOMAP_F_TRUNC_PAGE. > +static int > +__iomap_zero_range(struct iomap_iter *iter, bool *did_zero, > + const struct iomap_ops *ops) > +{ > + int ret; > + > + while ((ret = iomap_iter(iter, ops)) > 0) > + iter->processed = iomap_zero_iter(iter, did_zero); > + return ret; > +} I'd be tempted to just duplicate this two line loop instead of adding such a tivial helper, but that's just a matter of taste.
On Sat, Oct 29, 2022 at 08:30:14AM +1100, Dave Chinner wrote: > On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote: > > On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote: > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > > > > Here's a quick prototype of "option 3" described in my previous mail. > > > This has been spot tested and confirmed to prevent the original stale > > > data exposure problem. More thorough regression testing is still > > > required. Barring unforeseen issues with that, however, I think this is > > > tentatively my new preferred option. The primary reason for that is it > > > avoids looking at extent state and is more in line with what iomap based > > > zeroing should be doing more generically. > > > > > > Because of that, I think this provides a bit more opportunity for follow > > > on fixes (there are other truncate/zeroing problems I've come across > > > during this investigation that still need fixing), cleanup and > > > consolidation of the zeroing code. For example, I think the trajectory > > > of this could look something like: > > > > > > - Genericize a bit more to handle all truncates. > > > - Repurpose iomap_truncate_page() (currently only used by XFS) into a > > > unique implementation from zero range that does explicit zeroing > > > instead of relying on pagecache truncate. > > > - Refactor XFS ranged zeroing to an abstraction that uses a combination > > > of iomap_zero_range() and the new iomap_truncate_page(). > > > > > > > After playing with this and thinking a bit more about the above, I think > > I managed to come up with an iomap_truncate_page() prototype that DTRT > > based on this. Only spot tested so far, needs to pass iomap_flags to the > > other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has > > other bugs/warts, etc. etc. This is just a quick prototype to > > demonstrate the idea, which is essentially to check dirty state along > > with extent state while under lock and transfer that state back to iomap > > so it can decide whether it can shortcut or forcibly perform the zero. > > > > In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while > > under lock and implies that the range is sub-block (single page). > > IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact > > dirty, so perform the zero via buffered write regardless of extent > > state. > > I'd much prefer we fix this in the iomap infrastructure - failing to > zero dirty data in memory over an unwritten extent isn't an XFS bug, > so we shouldn't be working around it in XFS like we did previously. > I agree, but that was the original goal from the start. It's easier said than done to just have iomap accurately write/skip the appropriate ranges.. > I don't think this should be call "IOMAP_TRUNC_PAGE", though, > because that indicates the caller context, not what we are asking > the internal iomap code to do. What we are really asking is for > iomap_zero_iter() to do is zero the page cache if it exists in > memory, otherwise ignore unwritten/hole pages. Hence I think a name > like IOMAP_ZERO_PAGECACHE is more appropriate, > That was kind of the point for this prototype. The flag isn't just asking iomap to perform some generic write behavior. It also indicates this is a special snowflake mode with assumptions from the caller (i.e., unflushed, sub-block/partial range) that facilitate forced zeroing internally. I've since come to the conclusion that this approach is just premature. It really only does the right thing in this very particular case, otherwise there is potential for odd/unexpected behavior in sub-page blocksize scenarios. It could be made to work more appropriately eventually, but more thought and work is required and there's a jump in complexity that isn't required to fix the immediate performance problem and additional stale data exposure problems. > > > > Brian > > > > --- 8< --- > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 91ee0b308e13..14a9734b2838 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > > loff_t written = 0; > > > > /* already zeroed? we're done. */ > > - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) > > + if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) && > > + !(srcmap->flags & IOMAP_F_TRUNC_PAGE)) > > return length; > > Why even involve the filesystem in this? We can do this directly > in iomap_zero_iter() with: > > if ((srcmap->type == IOMAP_HOLE) > return; > if (srcmap->type == IOMAP_UNWRITTEN) { > if (!(iter->flags & IOMAP_ZERO_PAGECACHE)) > return; > if (!filemap_range_needs_writeback(inode->i_mapping, > iomap->offset, iomap->offset + iomap->length)) > return; > } > This reintroduces the same stale data exposure race fixed by the original patch. folio writeback can complete and convert an extent reported as unwritten such that zeroing will not occur when it should. The writeback check either needs to be in the fs code where it can prevent writeback completion from converting extents (under ilock), or earlier in iomap such that we're guaranteed to see either writeback state or the converted extent. Given the above, I'm currently putting the prototype shown below through some regression testing. It lifts the writeback check into iomap and issues a flush and retry of the current iteration from iomap_zero_iter() when necessary. In more focused testing thus far, this addresses the XFS truncate performance problem by making the flush conditional, prevents the original stale data exposure problem, and also addresses one or two similar outstanding problems buried in write extending patterns and whatnot. Since this all exists in iomap, no fs changes are required to provide a minimum guarantee of correctness (at the cost of additional, occasional flushes). If there are scenarios where the performance cost is a problem, an additional flag akin to the IOMAP_TRUNC_PAGE proposal can still be added and conditionally set by the fs if/when it wants to make that tradeoff. For example, XFS could pass some ->private state through the iomap_truncate_page() variant that ->iomap_begin() could use to trivially eliminate the flush in the truncate path, or implement a size or range trimming heuristic to eliminate/mitigate it in others. If/when iomap has enough information to do selective zeroing properly on its own without fs intervention, the flush fallback can simply go away. I have some patches around that demonstrate such things, but given the current behavior of an unconditional flush and so far only seeing one user report from an oddball pattern, I don't see much need for additional hacks at the moment. Thoughts? IMO, the iter retry and writeback check are both wonky and this is something that should first go into ->iomap_begin(), and only lift into iomap after a period of time. That's just my .02. I do plan to clean up with a retry helper and add comments and whatnot if this survives testing and/or any functional issues are worked out. Brian --- 8< --- diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 91ee0b308e13..649d94ad3808 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -894,17 +894,28 @@ EXPORT_SYMBOL_GPL(iomap_file_unshare); static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) { const struct iomap *srcmap = iomap_iter_srcmap(iter); + struct inode *inode = iter->inode; loff_t pos = iter->pos; loff_t length = iomap_length(iter); loff_t written = 0; + int status; /* already zeroed? we're done. */ - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) - return length; + if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) { + if (!(srcmap->flags & IOMAP_F_DIRTY_CACHE)) + return length; + + status = filemap_write_and_wait_range(inode->i_mapping, pos, + pos + length - 1); + if (status) + return status; + /* XXX: hacked up iter retry */ + iter->iomap.length = 0; + return 1; + } do { struct folio *folio; - int status; size_t offset; size_t bytes = min_t(u64, SIZE_MAX, length); @@ -916,6 +927,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) if (bytes > folio_size(folio) - offset) bytes = folio_size(folio) - offset; + trace_printk("%d: zero ino 0x%lx offset 0x%lx bytes 0x%lx\n", + __LINE__, folio->mapping->host->i_ino, offset, bytes); folio_zero_range(folio, offset, bytes); folio_mark_accessed(folio); diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c index a1c7592d2ade..ebfcc07ace94 100644 --- a/fs/iomap/iter.c +++ b/fs/iomap/iter.c @@ -5,6 +5,7 @@ */ #include <linux/fs.h> #include <linux/iomap.h> +#include <linux/pagemap.h> #include "trace.h" static inline int iomap_iter_advance(struct iomap_iter *iter) @@ -28,15 +29,18 @@ static inline int iomap_iter_advance(struct iomap_iter *iter) return 1; } -static inline void iomap_iter_done(struct iomap_iter *iter) +static inline void iomap_iter_done(struct iomap_iter *iter, u16 iomap_flags) { WARN_ON_ONCE(iter->iomap.offset > iter->pos); WARN_ON_ONCE(iter->iomap.length == 0); WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos); + iter->iomap.flags |= iomap_flags; trace_iomap_iter_dstmap(iter->inode, &iter->iomap); - if (iter->srcmap.type != IOMAP_HOLE) + if (iter->srcmap.type != IOMAP_HOLE) { + iter->srcmap.flags |= iomap_flags; trace_iomap_iter_srcmap(iter->inode, &iter->srcmap); + } } /** @@ -57,6 +61,7 @@ static inline void iomap_iter_done(struct iomap_iter *iter) int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops) { int ret; + u16 iomap_flags = 0; if (iter->iomap.length && ops->iomap_end) { ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter), @@ -71,10 +76,16 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops) if (ret <= 0) return ret; + if ((iter->flags & IOMAP_ZERO) && + filemap_range_needs_writeback(iter->inode->i_mapping, iter->pos, + iter->pos + iter->len - 1)) { + iomap_flags |= IOMAP_F_DIRTY_CACHE; + } + ret = ops->iomap_begin(iter->inode, iter->pos, iter->len, iter->flags, &iter->iomap, &iter->srcmap); if (ret < 0) return ret; - iomap_iter_done(iter); + iomap_iter_done(iter, iomap_flags); return 1; } diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 2e10e1c66ad6..3c40a81d6da0 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -840,16 +840,6 @@ xfs_setattr_size( error = xfs_zero_range(ip, oldsize, newsize - oldsize, &did_zeroing); } else { - /* - * iomap won't detect a dirty page over an unwritten block (or a - * cow block over a hole) and subsequently skips zeroing the - * newly post-EOF portion of the page. Flush the new EOF to - * convert the block before the pagecache truncate. - */ - error = filemap_write_and_wait_range(inode->i_mapping, newsize, - newsize); - if (error) - return error; error = xfs_truncate_page(ip, newsize, &did_zeroing); } diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 238a03087e17..bbfbda7bc905 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -56,6 +56,7 @@ struct vm_fault; #define IOMAP_F_MERGED 0x08 #define IOMAP_F_BUFFER_HEAD 0x10 #define IOMAP_F_ZONE_APPEND 0x20 +#define IOMAP_F_DIRTY_CACHE 0x40 /* * Flags set by the core iomap code during operations:
On Thu, Nov 03, 2022 at 10:53:16AM -0400, Brian Foster wrote: > On Sat, Oct 29, 2022 at 08:30:14AM +1100, Dave Chinner wrote: > > On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote: > > > On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote: > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > > --- > > > > > > > > Here's a quick prototype of "option 3" described in my previous mail. > > > > This has been spot tested and confirmed to prevent the original stale > > > > data exposure problem. More thorough regression testing is still > > > > required. Barring unforeseen issues with that, however, I think this is > > > > tentatively my new preferred option. The primary reason for that is it > > > > avoids looking at extent state and is more in line with what iomap based > > > > zeroing should be doing more generically. > > > > > > > > Because of that, I think this provides a bit more opportunity for follow > > > > on fixes (there are other truncate/zeroing problems I've come across > > > > during this investigation that still need fixing), cleanup and > > > > consolidation of the zeroing code. For example, I think the trajectory > > > > of this could look something like: > > > > > > > > - Genericize a bit more to handle all truncates. > > > > - Repurpose iomap_truncate_page() (currently only used by XFS) into a > > > > unique implementation from zero range that does explicit zeroing > > > > instead of relying on pagecache truncate. > > > > - Refactor XFS ranged zeroing to an abstraction that uses a combination > > > > of iomap_zero_range() and the new iomap_truncate_page(). > > > > > > > > > > After playing with this and thinking a bit more about the above, I think > > > I managed to come up with an iomap_truncate_page() prototype that DTRT > > > based on this. Only spot tested so far, needs to pass iomap_flags to the > > > other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has > > > other bugs/warts, etc. etc. This is just a quick prototype to > > > demonstrate the idea, which is essentially to check dirty state along > > > with extent state while under lock and transfer that state back to iomap > > > so it can decide whether it can shortcut or forcibly perform the zero. > > > > > > In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while > > > under lock and implies that the range is sub-block (single page). > > > IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact > > > dirty, so perform the zero via buffered write regardless of extent > > > state. > > > > I'd much prefer we fix this in the iomap infrastructure - failing to > > zero dirty data in memory over an unwritten extent isn't an XFS bug, > > so we shouldn't be working around it in XFS like we did previously. > > > > I agree, but that was the original goal from the start. It's easier said > than done to just have iomap accurately write/skip the appropriate > ranges.. > > > I don't think this should be call "IOMAP_TRUNC_PAGE", though, > > because that indicates the caller context, not what we are asking > > the internal iomap code to do. What we are really asking is for > > iomap_zero_iter() to do is zero the page cache if it exists in > > memory, otherwise ignore unwritten/hole pages. Hence I think a name > > like IOMAP_ZERO_PAGECACHE is more appropriate, > > > > That was kind of the point for this prototype. The flag isn't just > asking iomap to perform some generic write behavior. It also indicates > this is a special snowflake mode with assumptions from the caller (i.e., > unflushed, sub-block/partial range) that facilitate forced zeroing > internally. > > I've since come to the conclusion that this approach is just premature. > It really only does the right thing in this very particular case, > otherwise there is potential for odd/unexpected behavior in sub-page > blocksize scenarios. It could be made to work more appropriately > eventually, but more thought and work is required and there's a jump in > complexity that isn't required to fix the immediate performance problem > and additional stale data exposure problems. > > > > > > > Brian > > > > > > --- 8< --- > > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > > index 91ee0b308e13..14a9734b2838 100644 > > > --- a/fs/iomap/buffered-io.c > > > +++ b/fs/iomap/buffered-io.c > > > @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > > > loff_t written = 0; > > > > > > /* already zeroed? we're done. */ > > > - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) > > > + if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) && > > > + !(srcmap->flags & IOMAP_F_TRUNC_PAGE)) > > > return length; > > > > Why even involve the filesystem in this? We can do this directly > > in iomap_zero_iter() with: > > > > if ((srcmap->type == IOMAP_HOLE) > > return; > > if (srcmap->type == IOMAP_UNWRITTEN) { > > if (!(iter->flags & IOMAP_ZERO_PAGECACHE)) > > return; > > if (!filemap_range_needs_writeback(inode->i_mapping, > > iomap->offset, iomap->offset + iomap->length)) > > return; > > } > > > > This reintroduces the same stale data exposure race fixed by the > original patch. folio writeback can complete and convert an extent > reported as unwritten such that zeroing will not occur when it should. That sounds the exact data corruption bug that is being fixed in this xfs/iomap series here: https://lore.kernel.org/linux-xfs/20221101003412.3842572-1-david@fromorbit.com/ > The writeback check either needs to be in the fs code where it can > prevent writeback completion from converting extents (under ilock), or > earlier in iomap such that we're guaranteed to see either writeback > state or the converted extent. Or we catch stale iomaps once we've locked the page we are going to modify and remap the extent before continuing the modification operation. > Given the above, I'm currently putting the prototype shown below through > some regression testing. It lifts the writeback check into iomap and > issues a flush and retry of the current iteration from iomap_zero_iter() > when necessary. Hmmm. It looks exactly like you are trying to reinvent the patchset I've been working on for the past few weeks... [...] > I have some patches around that demonstrate such things, but given the > current behavior of an unconditional flush and so far only seeing one > user report from an oddball pattern, I don't see much need for > additional hacks at the moment. > > Thoughts? IMO, the iter retry and writeback check are both wonky and > this is something that should first go into ->iomap_begin(), and only > lift into iomap after a period of time. That's just my .02. I do plan to > clean up with a retry helper and add comments and whatnot if this > survives testing and/or any functional issues are worked out. I think I've already solved all these problems with the above IOMAP_F_STALE infrastructure in the above patchset. If that's the case, then it seems like a no-brainer to me to base the fix for this problem on the IOMAP_F_STALE capability we will soon have in this code... > index 91ee0b308e13..649d94ad3808 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -894,17 +894,28 @@ EXPORT_SYMBOL_GPL(iomap_file_unshare); > static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > { > const struct iomap *srcmap = iomap_iter_srcmap(iter); > + struct inode *inode = iter->inode; > loff_t pos = iter->pos; > loff_t length = iomap_length(iter); > loff_t written = 0; > + int status; > > /* already zeroed? we're done. */ > - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) > - return length; > + if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) { > + if (!(srcmap->flags & IOMAP_F_DIRTY_CACHE)) > + return length; > + > + status = filemap_write_and_wait_range(inode->i_mapping, pos, > + pos + length - 1); > + if (status) > + return status; > + /* XXX: hacked up iter retry */ > + iter->iomap.length = 0; > + return 1; Yup, the IOMAP_F_STALE detection in iomap_write_begin() will detect writeback races w/ unwritten extents that change the extent state without needing to actually do physical writeback here. IOWs, I think the fix here is not to skip IOMAP_UNWRITTEN extents up front, but to allow zeroing to detect data in the page cache over unwritten extents naturally. i.e. if the iomap type is unwritten and we are zeroing, then iomap_write_begin() drops the FGP_CREAT flag. This means the folio lookup will only returns a folio if there was already a cached folio in the page cache. If there is a cached folio, it will do iomap validation and detect writeback races, marking the iomap stale. If there is no folio or the folio returned is not dirty, then iomap_zero_iter just moves on to checking the next folio in the range... > @@ -71,10 +76,16 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops) > if (ret <= 0) > return ret; > > + if ((iter->flags & IOMAP_ZERO) && > + filemap_range_needs_writeback(iter->inode->i_mapping, iter->pos, > + iter->pos + iter->len - 1)) { > + iomap_flags |= IOMAP_F_DIRTY_CACHE; > + } I really don't think we need this - I think we can do the page cache state check and the iomap stale check directly in the iomap_zero_iter() loop as per above... Cheers, Dave.
On Fri, Nov 04, 2022 at 09:25:41AM +1100, Dave Chinner wrote: > On Thu, Nov 03, 2022 at 10:53:16AM -0400, Brian Foster wrote: > > On Sat, Oct 29, 2022 at 08:30:14AM +1100, Dave Chinner wrote: > > > On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote: > > > > On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote: > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > > > --- > > > > > > > > > > Here's a quick prototype of "option 3" described in my previous mail. > > > > > This has been spot tested and confirmed to prevent the original stale > > > > > data exposure problem. More thorough regression testing is still > > > > > required. Barring unforeseen issues with that, however, I think this is > > > > > tentatively my new preferred option. The primary reason for that is it > > > > > avoids looking at extent state and is more in line with what iomap based > > > > > zeroing should be doing more generically. > > > > > > > > > > Because of that, I think this provides a bit more opportunity for follow > > > > > on fixes (there are other truncate/zeroing problems I've come across > > > > > during this investigation that still need fixing), cleanup and > > > > > consolidation of the zeroing code. For example, I think the trajectory > > > > > of this could look something like: > > > > > > > > > > - Genericize a bit more to handle all truncates. > > > > > - Repurpose iomap_truncate_page() (currently only used by XFS) into a > > > > > unique implementation from zero range that does explicit zeroing > > > > > instead of relying on pagecache truncate. > > > > > - Refactor XFS ranged zeroing to an abstraction that uses a combination > > > > > of iomap_zero_range() and the new iomap_truncate_page(). > > > > > > > > > > > > > After playing with this and thinking a bit more about the above, I think > > > > I managed to come up with an iomap_truncate_page() prototype that DTRT > > > > based on this. Only spot tested so far, needs to pass iomap_flags to the > > > > other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has > > > > other bugs/warts, etc. etc. This is just a quick prototype to > > > > demonstrate the idea, which is essentially to check dirty state along > > > > with extent state while under lock and transfer that state back to iomap > > > > so it can decide whether it can shortcut or forcibly perform the zero. > > > > > > > > In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while > > > > under lock and implies that the range is sub-block (single page). > > > > IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact > > > > dirty, so perform the zero via buffered write regardless of extent > > > > state. > > > > > > I'd much prefer we fix this in the iomap infrastructure - failing to > > > zero dirty data in memory over an unwritten extent isn't an XFS bug, > > > so we shouldn't be working around it in XFS like we did previously. > > > > > > > I agree, but that was the original goal from the start. It's easier said > > than done to just have iomap accurately write/skip the appropriate > > ranges.. > > > > > I don't think this should be call "IOMAP_TRUNC_PAGE", though, > > > because that indicates the caller context, not what we are asking > > > the internal iomap code to do. What we are really asking is for > > > iomap_zero_iter() to do is zero the page cache if it exists in > > > memory, otherwise ignore unwritten/hole pages. Hence I think a name > > > like IOMAP_ZERO_PAGECACHE is more appropriate, > > > > > > > That was kind of the point for this prototype. The flag isn't just > > asking iomap to perform some generic write behavior. It also indicates > > this is a special snowflake mode with assumptions from the caller (i.e., > > unflushed, sub-block/partial range) that facilitate forced zeroing > > internally. > > > > I've since come to the conclusion that this approach is just premature. > > It really only does the right thing in this very particular case, > > otherwise there is potential for odd/unexpected behavior in sub-page > > blocksize scenarios. It could be made to work more appropriately > > eventually, but more thought and work is required and there's a jump in > > complexity that isn't required to fix the immediate performance problem > > and additional stale data exposure problems. > > > > > > > > > > Brian > > > > > > > > --- 8< --- > > > > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > > > index 91ee0b308e13..14a9734b2838 100644 > > > > --- a/fs/iomap/buffered-io.c > > > > +++ b/fs/iomap/buffered-io.c > > > > @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > > > > loff_t written = 0; > > > > > > > > /* already zeroed? we're done. */ > > > > - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) > > > > + if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) && > > > > + !(srcmap->flags & IOMAP_F_TRUNC_PAGE)) > > > > return length; > > > > > > Why even involve the filesystem in this? We can do this directly > > > in iomap_zero_iter() with: > > > > > > if ((srcmap->type == IOMAP_HOLE) > > > return; > > > if (srcmap->type == IOMAP_UNWRITTEN) { > > > if (!(iter->flags & IOMAP_ZERO_PAGECACHE)) > > > return; > > > if (!filemap_range_needs_writeback(inode->i_mapping, > > > iomap->offset, iomap->offset + iomap->length)) > > > return; > > > } > > > > > > > This reintroduces the same stale data exposure race fixed by the > > original patch. folio writeback can complete and convert an extent > > reported as unwritten such that zeroing will not occur when it should. > > That sounds the exact data corruption bug that is being fixed in > this xfs/iomap series here: > > https://lore.kernel.org/linux-xfs/20221101003412.3842572-1-david@fromorbit.com/ > > > The writeback check either needs to be in the fs code where it can > > prevent writeback completion from converting extents (under ilock), or > > earlier in iomap such that we're guaranteed to see either writeback > > state or the converted extent. > > Or we catch stale iomaps once we've locked the page we are going to > modify and remap the extent before continuing the modification > operation. > I wasn't planning to modify pages for reasons mentioned earlier. It's a non-trivial jump in complexity for the problem I'm trying to address. > > Given the above, I'm currently putting the prototype shown below through > > some regression testing. It lifts the writeback check into iomap and > > issues a flush and retry of the current iteration from iomap_zero_iter() > > when necessary. > > Hmmm. It looks exactly like you are trying to reinvent the patchset > I've been working on for the past few weeks... > Possibly.. I've not looked at it. First and foremost what I'm really trying to do here is find a better way to deal with the truncate flush problem that moves things closer to the long term goal of being handled by iomap. > [...] > > > I have some patches around that demonstrate such things, but given the > > current behavior of an unconditional flush and so far only seeing one > > user report from an oddball pattern, I don't see much need for > > additional hacks at the moment. > > > > Thoughts? IMO, the iter retry and writeback check are both wonky and > > this is something that should first go into ->iomap_begin(), and only > > lift into iomap after a period of time. That's just my .02. I do plan to > > clean up with a retry helper and add comments and whatnot if this > > survives testing and/or any functional issues are worked out. > > I think I've already solved all these problems with the above > IOMAP_F_STALE infrastructure in the above patchset. If that's the > case, then it seems like a no-brainer to me to base the fix for this > problem on the IOMAP_F_STALE capability we will soon have in this > code... > I'd have to take a closer look. If the IOMAP_F_STALE thing is essentially a superset of what I was trying to do wrt to the retry thing (which is what it sounds like), then I agree it makes sense to reuse that. > > index 91ee0b308e13..649d94ad3808 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -894,17 +894,28 @@ EXPORT_SYMBOL_GPL(iomap_file_unshare); > > static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > > { > > const struct iomap *srcmap = iomap_iter_srcmap(iter); > > + struct inode *inode = iter->inode; > > loff_t pos = iter->pos; > > loff_t length = iomap_length(iter); > > loff_t written = 0; > > + int status; > > > > /* already zeroed? we're done. */ > > - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) > > - return length; > > + if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) { > > + if (!(srcmap->flags & IOMAP_F_DIRTY_CACHE)) > > + return length; > > + > > + status = filemap_write_and_wait_range(inode->i_mapping, pos, > > + pos + length - 1); > > + if (status) > > + return status; > > + /* XXX: hacked up iter retry */ > > + iter->iomap.length = 0; > > + return 1; > > Yup, the IOMAP_F_STALE detection in iomap_write_begin() will detect > writeback races w/ unwritten extents that change the extent state > without needing to actually do physical writeback here. > > IOWs, I think the fix here is not to skip IOMAP_UNWRITTEN extents up > front, but to allow zeroing to detect data in the page cache over > unwritten extents naturally. > > i.e. if the iomap type is unwritten and we are zeroing, then > iomap_write_begin() drops the FGP_CREAT flag. This means the folio > lookup will only returns a folio if there was already a cached folio > in the page cache. If there is a cached folio, it will do iomap > validation and detect writeback races, marking the iomap stale. If > there is no folio or the folio returned is not dirty, then > iomap_zero_iter just moves on to checking the next folio in the > range... > The problem I have is that I've got three or four additional different prototypes trying to do such things that haven't even been worth posting because they all produce tricky to diagnose issues, including one that literally does exactly what you note above regarding dropping FGP_CREAT for mappings trimmed down based on sub-extent dirty state. It probably has its own bugs, but I suspect aspects of iomap zeroing as well as extent reporting in XFS (related to zeroing) need to change to make this all function correctly, but I don't have it characterized enough to say for sure. > > @@ -71,10 +76,16 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops) > > if (ret <= 0) > > return ret; > > > > + if ((iter->flags & IOMAP_ZERO) && > > + filemap_range_needs_writeback(iter->inode->i_mapping, iter->pos, > > + iter->pos + iter->len - 1)) { > > + iomap_flags |= IOMAP_F_DIRTY_CACHE; > > + } > > I really don't think we need this - I think we can do the page cache > state check and the iomap stale check directly in the > iomap_zero_iter() loop as per above... > Ok, this may be a reasonable end goal, but there's a clear jump in complexity, dependency requirements, and test scope between the targeted fix I proposed and ultimate iomap zeroing perfection. I'm not trying to achieve the latter. I'm trying to address a performance escalation that moves things more in that direction to the degree possible. I agree that if IOMAP_F_STALE overlaps with the approach in this patch, it makes no sense to try and duplicate that here or create code conflicts or dependencies. I have another version of this patch that implements the same dirty check, flush, retry logic within xfs_iomap_buffered_write_iomap_begin() such that it never returns a "dirty," hole/unwritten mapping. It's a bit more code, but effective from a functional and performance standpoint and logically simple enough to backport as far back as needed with minimal risk and dependencies. IMO, that's a fairly reasonable intermediate step to get from the currently broken state to a more ideal iomap-based implementation. I'll post it shortly and run some more testing against it. Brian
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 2e10e1c66ad6..1679fafaec6f 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -784,11 +784,13 @@ xfs_setattr_size( { struct xfs_mount *mp = ip->i_mount; struct inode *inode = VFS_I(ip); + struct folio *eof_folio = NULL; xfs_off_t oldsize, newsize; struct xfs_trans *tp; int error; uint lock_flags = 0; bool did_zeroing = false; + bool eof_dirty; ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL)); @@ -841,20 +843,40 @@ xfs_setattr_size( &did_zeroing); } else { /* - * iomap won't detect a dirty page over an unwritten block (or a - * cow block over a hole) and subsequently skips zeroing the - * newly post-EOF portion of the page. Flush the new EOF to - * convert the block before the pagecache truncate. + * iomap won't detect a dirty folio over an unwritten block (or + * a cow block over a hole) and subsequently skips zeroing the + * newly post-EOF portion of the folio. Doing a flush here (i.e. + * as is done for fallocate ZERO_RANGE) updates extent state for + * iomap, but has too much overhead for the truncate path. + * + * Instead, check whether the new EOF is dirty in pagecache. If + * so, hold a reference across the pagecache truncate and dirty + * the folio. This ensures that partial folio zeroing from the + * truncate makes it to disk in the rare event that iomap skips + * zeroing and writeback happens to complete before the + * pagecache truncate. Note that this really should be handled + * properly by iomap zero range. */ - error = filemap_write_and_wait_range(inode->i_mapping, newsize, - newsize); - if (error) - return error; + eof_folio = filemap_lock_folio(inode->i_mapping, + (newsize - 1) >> PAGE_SHIFT); + if (eof_folio) { + if (folio_test_dirty(eof_folio) || + folio_test_writeback(eof_folio)) + eof_dirty = true; + folio_unlock(eof_folio); + if (!eof_dirty) { + folio_put(eof_folio); + eof_folio = NULL; + } + } error = xfs_truncate_page(ip, newsize, &did_zeroing); } - if (error) + if (error) { + if (eof_folio) + folio_put(eof_folio); return error; + } /* * We've already locked out new page faults, so now we can safely remove @@ -878,6 +900,15 @@ xfs_setattr_size( * guaranteed not to write stale data past the new EOF on truncate down. */ truncate_setsize(inode, newsize); + if (eof_folio) { + trace_printk("%d: ino 0x%llx newsize 0x%llx folio idx 0x%lx did_zeroing %d\n", + __LINE__, ip->i_ino, newsize, folio_index(eof_folio), did_zeroing); + if (!did_zeroing) { + filemap_dirty_folio(inode->i_mapping, eof_folio); + did_zeroing = true; + } + folio_put(eof_folio); + } /* * We are going to log the inode size change in this transaction so
Signed-off-by: Brian Foster <bfoster@redhat.com> --- Here's a quick prototype of "option 3" described in my previous mail. This has been spot tested and confirmed to prevent the original stale data exposure problem. More thorough regression testing is still required. Barring unforeseen issues with that, however, I think this is tentatively my new preferred option. The primary reason for that is it avoids looking at extent state and is more in line with what iomap based zeroing should be doing more generically. Because of that, I think this provides a bit more opportunity for follow on fixes (there are other truncate/zeroing problems I've come across during this investigation that still need fixing), cleanup and consolidation of the zeroing code. For example, I think the trajectory of this could look something like: - Genericize a bit more to handle all truncates. - Repurpose iomap_truncate_page() (currently only used by XFS) into a unique implementation from zero range that does explicit zeroing instead of relying on pagecache truncate. - Refactor XFS ranged zeroing to an abstraction that uses a combination of iomap_zero_range() and the new iomap_truncate_page(). From there we'd hopefully have predictable and functionally correct zeroing in the filesystem. The next step would probably be to see if/how the truncate page and zero range implementations could combine into a single zero range implementation. I have vague thoughts on that, but at this stage I'm not going too deep into how that should look without some sort of functional implementation to base it on. Brian fs/xfs/xfs_iops.c | 49 ++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 40 insertions(+), 9 deletions(-)