Message ID | 20220826214703.134870-5-jlayton@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfs: clean up i_version behavior and expose it via statx | expand |
On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton <jlayton@kernel.org> wrote: > > xfs will update the i_version when updating only the atime value, which > is not desirable for any of the current consumers of i_version. Doing so > leads to unnecessary cache invalidations on NFS and extra measurement > activity in IMA. > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the > transaction should not update the i_version. Set that value in > xfs_vn_update_time if we're only updating the atime. > > Cc: Dave Chinner <david@fromorbit.com> > Cc: NeilBrown <neilb@suse.de> > Cc: Trond Myklebust <trondmy@hammerspace.com> > Cc: David Wysochanski <dwysocha@redhat.com> > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > fs/xfs/libxfs/xfs_log_format.h | 2 +- > fs/xfs/libxfs/xfs_trans_inode.c | 2 +- > fs/xfs/xfs_iops.c | 11 +++++++++-- > 3 files changed, 11 insertions(+), 4 deletions(-) > > Dave has NACK'ed this patch, but I'm sending it as a way to illustrate > the problem. I still think this approach should at least fix the worst > problems with atime updates being counted. We can look to carve out > other "spurious" i_version updates as we identify them. > AFAIK, "spurious" is only inode blocks map changes due to writeback of dirty pages. Anybody know about other cases? Regarding inode blocks map changes, first of all, I don't think that there is any practical loss from invalidating NFS client cache on dirty data writeback, because NFS server should be serving cold data most of the time. If there are a few unneeded cache invalidations they would only be temporary. One may even consider if NFSv4 server should not flush dirty data of an inode before granting a read lease to client. After all, if read lease was granted, client cached data and then server crashed before persisting the dirty data, then client will have cached a "future" version of the data and if i_version on the server did not roll back in that situation, we are looking at possible data corruptions. Same goes for IMA. IIUC, IMA data checksum would be stored in xattr? Storing in xattr a data checksum for data that is not persistent on disk would be an odd choice. So in my view, I only see benefits to current i_version users in the xfs i_version implementations and I don't think that it contradicts the i_version definition in the man page patch. > If however there are offline analysis tools that require atime updates > to be counted, then we won't be able to do this. If that's the case, how > can we fix this such that serving xfs via NFSv4 doesn't suck? > If I read the arguments correctly, implicit atime updates could be relaxed as long as this behavior is clearly documented and coherent on all implementations. Forensics and other applications that care about atime updates can and should check atime and don't need i_version to know that it was changed. The reliability of atime as an audit tool has dropped considerably since the default in relatime. If we want to be paranoid, maybe we can leave i_version increment on atime updates in case the user opted-in to strict '-o atime' updates, but IMO, there is no need for that. Thanks, Amir.
On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein <amir73il@gmail.com> wrote: > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > xfs will update the i_version when updating only the atime value, which > > is not desirable for any of the current consumers of i_version. Doing so > > leads to unnecessary cache invalidations on NFS and extra measurement > > activity in IMA. > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the > > transaction should not update the i_version. Set that value in > > xfs_vn_update_time if we're only updating the atime. > > > > Cc: Dave Chinner <david@fromorbit.com> > > Cc: NeilBrown <neilb@suse.de> > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > Cc: David Wysochanski <dwysocha@redhat.com> > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > fs/xfs/libxfs/xfs_log_format.h | 2 +- > > fs/xfs/libxfs/xfs_trans_inode.c | 2 +- > > fs/xfs/xfs_iops.c | 11 +++++++++-- > > 3 files changed, 11 insertions(+), 4 deletions(-) > > > > Dave has NACK'ed this patch, but I'm sending it as a way to illustrate > > the problem. I still think this approach should at least fix the worst > > problems with atime updates being counted. We can look to carve out > > other "spurious" i_version updates as we identify them. > > > > AFAIK, "spurious" is only inode blocks map changes due to writeback > of dirty pages. Anybody know about other cases? > > Regarding inode blocks map changes, first of all, I don't think that there is > any practical loss from invalidating NFS client cache on dirty data writeback, > because NFS server should be serving cold data most of the time. > If there are a few unneeded cache invalidations they would only be temporary. > Unless there is an issue with a writer NFS client that invalidates its own attribute caches on server data writeback? > One may even consider if NFSv4 server should not flush dirty data of an inode > before granting a read lease to client. > After all, if read lease was granted, client cached data and then server crashed > before persisting the dirty data, then client will have cached a > "future" version > of the data and if i_version on the server did not roll back in that situation, > we are looking at possible data corruptions. > > Same goes for IMA. IIUC, IMA data checksum would be stored in xattr? > Storing in xattr a data checksum for data that is not persistent on disk > would be an odd choice. > > So in my view, I only see benefits to current i_version users in the xfs > i_version implementations and I don't think that it contradicts the > i_version definition in the man page patch. > > > If however there are offline analysis tools that require atime updates > > to be counted, then we won't be able to do this. If that's the case, how > > can we fix this such that serving xfs via NFSv4 doesn't suck? > > > > If I read the arguments correctly, implicit atime updates could be relaxed > as long as this behavior is clearly documented and coherent on all > implementations. > > Forensics and other applications that care about atime updates can and > should check atime and don't need i_version to know that it was changed. > The reliability of atime as an audit tool has dropped considerably since > the default in relatime. > If we want to be paranoid, maybe we can leave i_version increment on > atime updates in case the user opted-in to strict '-o atime' updates, but > IMO, there is no need for that. > > Thanks, > Amir.
On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote: > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein <amir73il@gmail.com> wrote: > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > > > xfs will update the i_version when updating only the atime value, which > > > is not desirable for any of the current consumers of i_version. Doing so > > > leads to unnecessary cache invalidations on NFS and extra measurement > > > activity in IMA. > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the > > > transaction should not update the i_version. Set that value in > > > xfs_vn_update_time if we're only updating the atime. > > > > > > Cc: Dave Chinner <david@fromorbit.com> > > > Cc: NeilBrown <neilb@suse.de> > > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > > Cc: David Wysochanski <dwysocha@redhat.com> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > --- > > > fs/xfs/libxfs/xfs_log_format.h | 2 +- > > > fs/xfs/libxfs/xfs_trans_inode.c | 2 +- > > > fs/xfs/xfs_iops.c | 11 +++++++++-- > > > 3 files changed, 11 insertions(+), 4 deletions(-) > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to illustrate > > > the problem. I still think this approach should at least fix the worst > > > problems with atime updates being counted. We can look to carve out > > > other "spurious" i_version updates as we identify them. > > > > > > > AFAIK, "spurious" is only inode blocks map changes due to writeback > > of dirty pages. Anybody know about other cases? > > > > Regarding inode blocks map changes, first of all, I don't think that there is > > any practical loss from invalidating NFS client cache on dirty data writeback, > > because NFS server should be serving cold data most of the time. > > If there are a few unneeded cache invalidations they would only be temporary. > > > > Unless there is an issue with a writer NFS client that invalidates its > own attribute > caches on server data writeback? > The client just looks at the file attributes (of which i_version is but one), and if certain attributes have changed (mtime, ctime, i_version, etc...) then it invalidates its cache. In the case of blocks map changes, could that mean a difference in the observable sparse regions of the file? If so, then a READ_PLUS before the change and a READ_PLUS after could give different results. Since that difference is observable by the client, I'd think we'd want to bump i_version for that anyway. > > One may even consider if NFSv4 server should not flush dirty data of an inode > > before granting a read lease to client. > > After all, if read lease was granted, client cached data and then server crashed > > before persisting the dirty data, then client will have cached a > > "future" version > > of the data and if i_version on the server did not roll back in that situation, > > we are looking at possible data corruptions. > > We don't hand out read leases if there are file descriptions open for write. NFS clients usually issue a COMMIT before closing a stateid in order to satisfy close-to-open cache coherency. So in most cases, this is probably not an issue. It might still be worthwhile to make sure of it by doing a filemap_write_and_wait before we hand out a delegation, but that's likely to be a no-op in most cases anyway. Note too that the client will still revalidate its caches when it receives attributes even when it holds a read delegation. In fact, this behavior mostly papered over a rather nasty knfsd bug we found recently where it was allowing conflicting activity to proceed even when there was a read delegation outstanding. > > Same goes for IMA. IIUC, IMA data checksum would be stored in xattr? > > Storing in xattr a data checksum for data that is not persistent on disk > > would be an odd choice. > > > > So in my view, I only see benefits to current i_version users in the xfs > > i_version implementations and I don't think that it contradicts the > > i_version definition in the man page patch. > > > > > If however there are offline analysis tools that require atime updates > > > to be counted, then we won't be able to do this. If that's the case, how > > > can we fix this such that serving xfs via NFSv4 doesn't suck? > > > > > > > If I read the arguments correctly, implicit atime updates could be relaxed > > as long as this behavior is clearly documented and coherent on all > > implementations. > > > > Forensics and other applications that care about atime updates can and > > should check atime and don't need i_version to know that it was changed. > > The reliability of atime as an audit tool has dropped considerably since > > the default in relatime. > > If we want to be paranoid, maybe we can leave i_version increment on > > atime updates in case the user opted-in to strict '-o atime' updates, but > > IMO, there is no need for that. > > Thanks,
On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote: > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote: > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > > xfs will update the i_version when updating only the atime value, which > > > > is not desirable for any of the current consumers of i_version. Doing so > > > > leads to unnecessary cache invalidations on NFS and extra measurement > > > > activity in IMA. > > > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the > > > > transaction should not update the i_version. Set that value in > > > > xfs_vn_update_time if we're only updating the atime. > > > > > > > > Cc: Dave Chinner <david@fromorbit.com> > > > > Cc: NeilBrown <neilb@suse.de> > > > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > > > Cc: David Wysochanski <dwysocha@redhat.com> > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > --- > > > > fs/xfs/libxfs/xfs_log_format.h | 2 +- > > > > fs/xfs/libxfs/xfs_trans_inode.c | 2 +- > > > > fs/xfs/xfs_iops.c | 11 +++++++++-- > > > > 3 files changed, 11 insertions(+), 4 deletions(-) > > > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to illustrate > > > > the problem. I still think this approach should at least fix the worst > > > > problems with atime updates being counted. We can look to carve out > > > > other "spurious" i_version updates as we identify them. > > > > > > > > > > AFAIK, "spurious" is only inode blocks map changes due to writeback > > > of dirty pages. Anybody know about other cases? > > > > > > Regarding inode blocks map changes, first of all, I don't think that there is > > > any practical loss from invalidating NFS client cache on dirty data writeback, > > > because NFS server should be serving cold data most of the time. > > > If there are a few unneeded cache invalidations they would only be temporary. > > > > > > > Unless there is an issue with a writer NFS client that invalidates its > > own attribute > > caches on server data writeback? > > > > The client just looks at the file attributes (of which i_version is but > one), and if certain attributes have changed (mtime, ctime, i_version, > etc...) then it invalidates its cache. > > In the case of blocks map changes, could that mean a difference in the > observable sparse regions of the file? If so, then a READ_PLUS before > the change and a READ_PLUS after could give different results. Since > that difference is observable by the client, I'd think we'd want to bump > i_version for that anyway. How /is/ READ_PLUS supposed to detect sparse regions, anyway? I know that's been the subject of recent debate. At least as far as XFS is concerned, a file range can go from hole -> delayed allocation reservation -> unwritten extent -> (actual writeback) -> written extent. The dance became rather more complex when we added COW. If any of that will make a difference for READ_PLUS, then yes, I think you'd want file writeback activities to bump iversion to cause client invalidations, like (I think) Dave said. The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for written and delalloc extents; and an unwritten extent will report data for any pagecache it finds. > > > One may even consider if NFSv4 server should not flush dirty data of an inode > > > before granting a read lease to client. > > > After all, if read lease was granted, client cached data and then server crashed > > > before persisting the dirty data, then client will have cached a > > > "future" version > > > of the data and if i_version on the server did not roll back in that situation, > > > we are looking at possible data corruptions. > > > > > We don't hand out read leases if there are file descriptions open for > write. NFS clients usually issue a COMMIT before closing a stateid in > order to satisfy close-to-open cache coherency. > > So in most cases, this is probably not an issue. It might still be > worthwhile to make sure of it by doing a filemap_write_and_wait before > we hand out a delegation, but that's likely to be a no-op in most cases > anyway. > > Note too that the client will still revalidate its caches when it > receives attributes even when it holds a read delegation. In fact, this > behavior mostly papered over a rather nasty knfsd bug we found recently > where it was allowing conflicting activity to proceed even when there > was a read delegation outstanding. > > > > Same goes for IMA. IIUC, IMA data checksum would be stored in xattr? > > > Storing in xattr a data checksum for data that is not persistent on disk > > > would be an odd choice. > > > > > > So in my view, I only see benefits to current i_version users in the xfs > > > i_version implementations and I don't think that it contradicts the > > > i_version definition in the man page patch. > > > > > > > If however there are offline analysis tools that require atime updates > > > > to be counted, then we won't be able to do this. If that's the case, how > > > > can we fix this such that serving xfs via NFSv4 doesn't suck? > > > > > > > > > > If I read the arguments correctly, implicit atime updates could be relaxed > > > as long as this behavior is clearly documented and coherent on all > > > implementations. > > > > > > Forensics and other applications that care about atime updates can and > > > should check atime and don't need i_version to know that it was changed. > > > The reliability of atime as an audit tool has dropped considerably since > > > the default in relatime. I've been waiting for Amir to appear in this discussion -- ISTR that a few years ago you were wanting the ability to scan a filesystem to look for files that have changed since a given point. If XFS exported its di_changecount file attribute (as it currently behaves) via BULKSTAT, you'd have the ability to do that, so long as your application could persist bulkstat data and compare. --D > > > If we want to be paranoid, maybe we can leave i_version increment on > > > atime updates in case the user opted-in to strict '-o atime' updates, but > > > IMO, there is no need for that. > > > > > Thanks, > -- > Jeff Layton <jlayton@kernel.org>
On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote: > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote: > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote: > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein > > > <amir73il@gmail.com> wrote: > > > > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton > > > > <jlayton@kernel.org> wrote: > > > > > > > > > > xfs will update the i_version when updating only the atime > > > > > value, which > > > > > is not desirable for any of the current consumers of > > > > > i_version. Doing so > > > > > leads to unnecessary cache invalidations on NFS and extra > > > > > measurement > > > > > activity in IMA. > > > > > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that > > > > > the > > > > > transaction should not update the i_version. Set that value > > > > > in > > > > > xfs_vn_update_time if we're only updating the atime. > > > > > > > > > > Cc: Dave Chinner <david@fromorbit.com> > > > > > Cc: NeilBrown <neilb@suse.de> > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > > > > Cc: David Wysochanski <dwysocha@redhat.com> > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > --- > > > > > fs/xfs/libxfs/xfs_log_format.h | 2 +- > > > > > fs/xfs/libxfs/xfs_trans_inode.c | 2 +- > > > > > fs/xfs/xfs_iops.c | 11 +++++++++-- > > > > > 3 files changed, 11 insertions(+), 4 deletions(-) > > > > > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to > > > > > illustrate > > > > > the problem. I still think this approach should at least fix > > > > > the worst > > > > > problems with atime updates being counted. We can look to > > > > > carve out > > > > > other "spurious" i_version updates as we identify them. > > > > > > > > > > > > > AFAIK, "spurious" is only inode blocks map changes due to > > > > writeback > > > > of dirty pages. Anybody know about other cases? > > > > > > > > Regarding inode blocks map changes, first of all, I don't think > > > > that there is > > > > any practical loss from invalidating NFS client cache on dirty > > > > data writeback, > > > > because NFS server should be serving cold data most of the > > > > time. > > > > If there are a few unneeded cache invalidations they would only > > > > be temporary. > > > > > > > > > > Unless there is an issue with a writer NFS client that > > > invalidates its > > > own attribute > > > caches on server data writeback? > > > > > > > The client just looks at the file attributes (of which i_version is > > but > > one), and if certain attributes have changed (mtime, ctime, > > i_version, > > etc...) then it invalidates its cache. > > > > In the case of blocks map changes, could that mean a difference in > > the > > observable sparse regions of the file? If so, then a READ_PLUS > > before > > the change and a READ_PLUS after could give different results. > > Since > > that difference is observable by the client, I'd think we'd want to > > bump > > i_version for that anyway. > > How /is/ READ_PLUS supposed to detect sparse regions, anyway? I know > that's been the subject of recent debate. At least as far as XFS is > concerned, a file range can go from hole -> delayed allocation > reservation -> unwritten extent -> (actual writeback) -> written > extent. > The dance became rather more complex when we added COW. If any of > that > will make a difference for READ_PLUS, then yes, I think you'd want > file > writeback activities to bump iversion to cause client invalidations, > like (I think) Dave said. > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for > written and delalloc extents; and an unwritten extent will report > data > for any pagecache it finds. > READ_PLUS should never return anything different than a read() system call would return for any given area. The way it reports sparse regions vs. data regions is purely an RPC formatting convenience. The only point to note about NFS READ and READ_PLUS is that because the client is forced to send multiple RPC calls if the user is trying to read a region that is larger than the 'rsize' value, it is possible that these READ/READ_PLUS calls may be processed out of order, and so the result may end up looking different than if you had executed a read() call for the full region directly on the server. However each individual READ / READ_PLUS reply should look as if the user had called read() on that rsize-sized section of the file. > > >
On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote: > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote: > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote: > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote: > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein > > > > <amir73il@gmail.com> wrote: > > > > > > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton > > > > > <jlayton@kernel.org> wrote: > > > > > > > > > > > > xfs will update the i_version when updating only the atime > > > > > > value, which > > > > > > is not desirable for any of the current consumers of > > > > > > i_version. Doing so > > > > > > leads to unnecessary cache invalidations on NFS and extra > > > > > > measurement > > > > > > activity in IMA. > > > > > > > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that > > > > > > the > > > > > > transaction should not update the i_version. Set that value > > > > > > in > > > > > > xfs_vn_update_time if we're only updating the atime. > > > > > > > > > > > > Cc: Dave Chinner <david@fromorbit.com> > > > > > > Cc: NeilBrown <neilb@suse.de> > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > > > > > Cc: David Wysochanski <dwysocha@redhat.com> > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > --- > > > > > > fs/xfs/libxfs/xfs_log_format.h | 2 +- > > > > > > fs/xfs/libxfs/xfs_trans_inode.c | 2 +- > > > > > > fs/xfs/xfs_iops.c | 11 +++++++++-- > > > > > > 3 files changed, 11 insertions(+), 4 deletions(-) > > > > > > > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to > > > > > > illustrate > > > > > > the problem. I still think this approach should at least fix > > > > > > the worst > > > > > > problems with atime updates being counted. We can look to > > > > > > carve out > > > > > > other "spurious" i_version updates as we identify them. > > > > > > > > > > > > > > > > AFAIK, "spurious" is only inode blocks map changes due to > > > > > writeback > > > > > of dirty pages. Anybody know about other cases? > > > > > > > > > > Regarding inode blocks map changes, first of all, I don't think > > > > > that there is > > > > > any practical loss from invalidating NFS client cache on dirty > > > > > data writeback, > > > > > because NFS server should be serving cold data most of the > > > > > time. > > > > > If there are a few unneeded cache invalidations they would only > > > > > be temporary. > > > > > > > > > > > > > Unless there is an issue with a writer NFS client that > > > > invalidates its > > > > own attribute > > > > caches on server data writeback? > > > > > > > > > > The client just looks at the file attributes (of which i_version is > > > but > > > one), and if certain attributes have changed (mtime, ctime, > > > i_version, > > > etc...) then it invalidates its cache. > > > > > > In the case of blocks map changes, could that mean a difference in > > > the > > > observable sparse regions of the file? If so, then a READ_PLUS > > > before > > > the change and a READ_PLUS after could give different results. > > > Since > > > that difference is observable by the client, I'd think we'd want to > > > bump > > > i_version for that anyway. > > > > How /is/ READ_PLUS supposed to detect sparse regions, anyway? I know > > that's been the subject of recent debate. At least as far as XFS is > > concerned, a file range can go from hole -> delayed allocation > > reservation -> unwritten extent -> (actual writeback) -> written > > extent. > > The dance became rather more complex when we added COW. If any of > > that > > will make a difference for READ_PLUS, then yes, I think you'd want > > file > > writeback activities to bump iversion to cause client invalidations, > > like (I think) Dave said. > > > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for > > written and delalloc extents; and an unwritten extent will report > > data > > for any pagecache it finds. > > > > READ_PLUS should never return anything different than a read() system > call would return for any given area. The way it reports sparse regions > vs. data regions is purely an RPC formatting convenience. > > The only point to note about NFS READ and READ_PLUS is that because the > client is forced to send multiple RPC calls if the user is trying to > read a region that is larger than the 'rsize' value, it is possible > that these READ/READ_PLUS calls may be processed out of order, and so > the result may end up looking different than if you had executed a > read() call for the full region directly on the server. > However each individual READ / READ_PLUS reply should look as if the > user had called read() on that rsize-sized section of the file. > > > Yeah, thinking about it some more, simply changing the block allocation is not something that should affect the ctime, so we probably don't want to bump i_version on it. It's an implicit change, IOW, not an explicit one. The fact that xfs might do that is unfortunate, but it's not the end of the world and it still would conform to the proposed definition for i_version. In practice, this sort of allocation change should come soon after the file was written, so one would hope that any damage due to the false i_version bump would be minimized. It would be nice to teach it not to do that however. Maybe we can insert the NOIVER flag at a strategic place to avoid it?
On Sat, 2022-08-27 at 12:10 -0400, Jeff Layton wrote: > On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote: > > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote: > > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote: > > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote: > > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein > > > > > <amir73il@gmail.com> wrote: > > > > > > > > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton > > > > > > <jlayton@kernel.org> wrote: > > > > > > > > > > > > > > xfs will update the i_version when updating only the > > > > > > > atime > > > > > > > value, which > > > > > > > is not desirable for any of the current consumers of > > > > > > > i_version. Doing so > > > > > > > leads to unnecessary cache invalidations on NFS and extra > > > > > > > measurement > > > > > > > activity in IMA. > > > > > > > > > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate > > > > > > > that > > > > > > > the > > > > > > > transaction should not update the i_version. Set that > > > > > > > value > > > > > > > in > > > > > > > xfs_vn_update_time if we're only updating the atime. > > > > > > > > > > > > > > Cc: Dave Chinner <david@fromorbit.com> > > > > > > > Cc: NeilBrown <neilb@suse.de> > > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > > > > > > Cc: David Wysochanski <dwysocha@redhat.com> > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > > --- > > > > > > > fs/xfs/libxfs/xfs_log_format.h | 2 +- > > > > > > > fs/xfs/libxfs/xfs_trans_inode.c | 2 +- > > > > > > > fs/xfs/xfs_iops.c | 11 +++++++++-- > > > > > > > 3 files changed, 11 insertions(+), 4 deletions(-) > > > > > > > > > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way > > > > > > > to > > > > > > > illustrate > > > > > > > the problem. I still think this approach should at least > > > > > > > fix > > > > > > > the worst > > > > > > > problems with atime updates being counted. We can look to > > > > > > > carve out > > > > > > > other "spurious" i_version updates as we identify them. > > > > > > > > > > > > > > > > > > > AFAIK, "spurious" is only inode blocks map changes due to > > > > > > writeback > > > > > > of dirty pages. Anybody know about other cases? > > > > > > > > > > > > Regarding inode blocks map changes, first of all, I don't > > > > > > think > > > > > > that there is > > > > > > any practical loss from invalidating NFS client cache on > > > > > > dirty > > > > > > data writeback, > > > > > > because NFS server should be serving cold data most of the > > > > > > time. > > > > > > If there are a few unneeded cache invalidations they would > > > > > > only > > > > > > be temporary. > > > > > > > > > > > > > > > > Unless there is an issue with a writer NFS client that > > > > > invalidates its > > > > > own attribute > > > > > caches on server data writeback? > > > > > > > > > > > > > The client just looks at the file attributes (of which > > > > i_version is > > > > but > > > > one), and if certain attributes have changed (mtime, ctime, > > > > i_version, > > > > etc...) then it invalidates its cache. > > > > > > > > In the case of blocks map changes, could that mean a difference > > > > in > > > > the > > > > observable sparse regions of the file? If so, then a READ_PLUS > > > > before > > > > the change and a READ_PLUS after could give different results. > > > > Since > > > > that difference is observable by the client, I'd think we'd > > > > want to > > > > bump > > > > i_version for that anyway. > > > > > > How /is/ READ_PLUS supposed to detect sparse regions, anyway? I > > > know > > > that's been the subject of recent debate. At least as far as XFS > > > is > > > concerned, a file range can go from hole -> delayed allocation > > > reservation -> unwritten extent -> (actual writeback) -> written > > > extent. > > > The dance became rather more complex when we added COW. If any > > > of > > > that > > > will make a difference for READ_PLUS, then yes, I think you'd > > > want > > > file > > > writeback activities to bump iversion to cause client > > > invalidations, > > > like (I think) Dave said. > > > > > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data > > > for > > > written and delalloc extents; and an unwritten extent will report > > > data > > > for any pagecache it finds. > > > > > > > READ_PLUS should never return anything different than a read() > > system > > call would return for any given area. The way it reports sparse > > regions > > vs. data regions is purely an RPC formatting convenience. > > > > The only point to note about NFS READ and READ_PLUS is that because > > the > > client is forced to send multiple RPC calls if the user is trying > > to > > read a region that is larger than the 'rsize' value, it is possible > > that these READ/READ_PLUS calls may be processed out of order, and > > so > > the result may end up looking different than if you had executed a > > read() call for the full region directly on the server. > > However each individual READ / READ_PLUS reply should look as if > > the > > user had called read() on that rsize-sized section of the file. > > > > > > Yeah, thinking about it some more, simply changing the block > allocation > is not something that should affect the ctime, so we probably don't > want > to bump i_version on it. It's an implicit change, IOW, not an > explicit > one. As you say, it is unfortunate that XFS does this, and it is unfortunate that it then changes the blocks allocated attribute post-fsync(), since all that does cause confusion for certain applications. However I agree 100% that this is an implicit change that is driven by the filesystem and not the user application. Hence it is not an action that needs to be recorded with a change attribute bump.
On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote: > > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote: > > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote: > > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote: > > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein > > > > > <amir73il@gmail.com> wrote: > > > > > > > > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton > > > > > > <jlayton@kernel.org> wrote: > > > > > > > > > > > > > > xfs will update the i_version when updating only the atime > > > > > > > value, which > > > > > > > is not desirable for any of the current consumers of > > > > > > > i_version. Doing so > > > > > > > leads to unnecessary cache invalidations on NFS and extra > > > > > > > measurement > > > > > > > activity in IMA. > > > > > > > > > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that > > > > > > > the > > > > > > > transaction should not update the i_version. Set that value > > > > > > > in > > > > > > > xfs_vn_update_time if we're only updating the atime. > > > > > > > > > > > > > > Cc: Dave Chinner <david@fromorbit.com> > > > > > > > Cc: NeilBrown <neilb@suse.de> > > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > > > > > > Cc: David Wysochanski <dwysocha@redhat.com> > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > > --- > > > > > > > fs/xfs/libxfs/xfs_log_format.h | 2 +- > > > > > > > fs/xfs/libxfs/xfs_trans_inode.c | 2 +- > > > > > > > fs/xfs/xfs_iops.c | 11 +++++++++-- > > > > > > > 3 files changed, 11 insertions(+), 4 deletions(-) > > > > > > > > > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to > > > > > > > illustrate > > > > > > > the problem. I still think this approach should at least fix > > > > > > > the worst > > > > > > > problems with atime updates being counted. We can look to > > > > > > > carve out > > > > > > > other "spurious" i_version updates as we identify them. > > > > > > > > > > > > > > > > > > > AFAIK, "spurious" is only inode blocks map changes due to > > > > > > writeback > > > > > > of dirty pages. Anybody know about other cases? > > > > > > > > > > > > Regarding inode blocks map changes, first of all, I don't think > > > > > > that there is > > > > > > any practical loss from invalidating NFS client cache on dirty > > > > > > data writeback, > > > > > > because NFS server should be serving cold data most of the > > > > > > time. > > > > > > If there are a few unneeded cache invalidations they would only > > > > > > be temporary. > > > > > > > > > > > > > > > > Unless there is an issue with a writer NFS client that > > > > > invalidates its > > > > > own attribute > > > > > caches on server data writeback? > > > > > > > > > > > > > The client just looks at the file attributes (of which i_version is > > > > but > > > > one), and if certain attributes have changed (mtime, ctime, > > > > i_version, > > > > etc...) then it invalidates its cache. > > > > > > > > In the case of blocks map changes, could that mean a difference in > > > > the > > > > observable sparse regions of the file? If so, then a READ_PLUS > > > > before > > > > the change and a READ_PLUS after could give different results. > > > > Since > > > > that difference is observable by the client, I'd think we'd want to > > > > bump > > > > i_version for that anyway. > > > > > > How /is/ READ_PLUS supposed to detect sparse regions, anyway? I know > > > that's been the subject of recent debate. At least as far as XFS is > > > concerned, a file range can go from hole -> delayed allocation > > > reservation -> unwritten extent -> (actual writeback) -> written > > > extent. > > > The dance became rather more complex when we added COW. If any of > > > that > > > will make a difference for READ_PLUS, then yes, I think you'd want > > > file > > > writeback activities to bump iversion to cause client invalidations, > > > like (I think) Dave said. > > > > > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for > > > written and delalloc extents; and an unwritten extent will report > > > data > > > for any pagecache it finds. > > > > > > > READ_PLUS should never return anything different than a read() system > > call would return for any given area. The way it reports sparse regions > > vs. data regions is purely an RPC formatting convenience. > > > > The only point to note about NFS READ and READ_PLUS is that because the > > client is forced to send multiple RPC calls if the user is trying to > > read a region that is larger than the 'rsize' value, it is possible > > that these READ/READ_PLUS calls may be processed out of order, and so > > the result may end up looking different than if you had executed a > > read() call for the full region directly on the server. > > However each individual READ / READ_PLUS reply should look as if the > > user had called read() on that rsize-sized section of the file. > > > > > > Yeah, thinking about it some more, simply changing the block allocation > is not something that should affect the ctime, so we probably don't want > to bump i_version on it. It's an implicit change, IOW, not an explicit > one. > > The fact that xfs might do that is unfortunate, but it's not the end of > the world and it still would conform to the proposed definition for > i_version. In practice, this sort of allocation change should come soon > after the file was written, so one would hope that any damage due to the > false i_version bump would be minimized. > That was exactly my point. > It would be nice to teach it not to do that however. Maybe we can insert > the NOIVER flag at a strategic place to avoid it? Why would that be nice to avoid? You did not specify any use case where incrementing i_version on block mapping change matters in practice. On the contrary, you said that NFS client writer sends COMMIT on close, which should stabilize i_version for the next readers. Given that we already have an xfs implementation that does increment i_version on block mapping changes and it would be a pain to change that or add a new user options, I don't see the point in discussing it further unless there is a good incentive for avoiding i_version updates in those cases. Thanks, Amir.
On Sun, 2022-08-28 at 16:25 +0300, Amir Goldstein wrote: > On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote: > > > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote: > > > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote: > > > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote: > > > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein > > > > > > <amir73il@gmail.com> wrote: > > > > > > > > > > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton > > > > > > > <jlayton@kernel.org> wrote: > > > > > > > > > > > > > > > > xfs will update the i_version when updating only the atime > > > > > > > > value, which > > > > > > > > is not desirable for any of the current consumers of > > > > > > > > i_version. Doing so > > > > > > > > leads to unnecessary cache invalidations on NFS and extra > > > > > > > > measurement > > > > > > > > activity in IMA. > > > > > > > > > > > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that > > > > > > > > the > > > > > > > > transaction should not update the i_version. Set that value > > > > > > > > in > > > > > > > > xfs_vn_update_time if we're only updating the atime. > > > > > > > > > > > > > > > > Cc: Dave Chinner <david@fromorbit.com> > > > > > > > > Cc: NeilBrown <neilb@suse.de> > > > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > > > > > > > Cc: David Wysochanski <dwysocha@redhat.com> > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > > > --- > > > > > > > > fs/xfs/libxfs/xfs_log_format.h | 2 +- > > > > > > > > fs/xfs/libxfs/xfs_trans_inode.c | 2 +- > > > > > > > > fs/xfs/xfs_iops.c | 11 +++++++++-- > > > > > > > > 3 files changed, 11 insertions(+), 4 deletions(-) > > > > > > > > > > > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to > > > > > > > > illustrate > > > > > > > > the problem. I still think this approach should at least fix > > > > > > > > the worst > > > > > > > > problems with atime updates being counted. We can look to > > > > > > > > carve out > > > > > > > > other "spurious" i_version updates as we identify them. > > > > > > > > > > > > > > > > > > > > > > AFAIK, "spurious" is only inode blocks map changes due to > > > > > > > writeback > > > > > > > of dirty pages. Anybody know about other cases? > > > > > > > > > > > > > > Regarding inode blocks map changes, first of all, I don't think > > > > > > > that there is > > > > > > > any practical loss from invalidating NFS client cache on dirty > > > > > > > data writeback, > > > > > > > because NFS server should be serving cold data most of the > > > > > > > time. > > > > > > > If there are a few unneeded cache invalidations they would only > > > > > > > be temporary. > > > > > > > > > > > > > > > > > > > Unless there is an issue with a writer NFS client that > > > > > > invalidates its > > > > > > own attribute > > > > > > caches on server data writeback? > > > > > > > > > > > > > > > > The client just looks at the file attributes (of which i_version is > > > > > but > > > > > one), and if certain attributes have changed (mtime, ctime, > > > > > i_version, > > > > > etc...) then it invalidates its cache. > > > > > > > > > > In the case of blocks map changes, could that mean a difference in > > > > > the > > > > > observable sparse regions of the file? If so, then a READ_PLUS > > > > > before > > > > > the change and a READ_PLUS after could give different results. > > > > > Since > > > > > that difference is observable by the client, I'd think we'd want to > > > > > bump > > > > > i_version for that anyway. > > > > > > > > How /is/ READ_PLUS supposed to detect sparse regions, anyway? I know > > > > that's been the subject of recent debate. At least as far as XFS is > > > > concerned, a file range can go from hole -> delayed allocation > > > > reservation -> unwritten extent -> (actual writeback) -> written > > > > extent. > > > > The dance became rather more complex when we added COW. If any of > > > > that > > > > will make a difference for READ_PLUS, then yes, I think you'd want > > > > file > > > > writeback activities to bump iversion to cause client invalidations, > > > > like (I think) Dave said. > > > > > > > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for > > > > written and delalloc extents; and an unwritten extent will report > > > > data > > > > for any pagecache it finds. > > > > > > > > > > READ_PLUS should never return anything different than a read() system > > > call would return for any given area. The way it reports sparse regions > > > vs. data regions is purely an RPC formatting convenience. > > > > > > The only point to note about NFS READ and READ_PLUS is that because the > > > client is forced to send multiple RPC calls if the user is trying to > > > read a region that is larger than the 'rsize' value, it is possible > > > that these READ/READ_PLUS calls may be processed out of order, and so > > > the result may end up looking different than if you had executed a > > > read() call for the full region directly on the server. > > > However each individual READ / READ_PLUS reply should look as if the > > > user had called read() on that rsize-sized section of the file. > > > > > > > > > Yeah, thinking about it some more, simply changing the block allocation > > is not something that should affect the ctime, so we probably don't want > > to bump i_version on it. It's an implicit change, IOW, not an explicit > > one. > > > > The fact that xfs might do that is unfortunate, but it's not the end of > > the world and it still would conform to the proposed definition for > > i_version. In practice, this sort of allocation change should come soon > > after the file was written, so one would hope that any damage due to the > > false i_version bump would be minimized. > > > > That was exactly my point. > > > It would be nice to teach it not to do that however. Maybe we can insert > > the NOIVER flag at a strategic place to avoid it? > > Why would that be nice to avoid? > You did not specify any use case where incrementing i_version > on block mapping change matters in practice. > On the contrary, you said that NFS client writer sends COMMIT on close, > which should stabilize i_version for the next readers. > > Given that we already have an xfs implementation that does increment > i_version on block mapping changes and it would be a pain to change > that or add a new user options, I don't see the point in discussing it further > unless there is a good incentive for avoiding i_version updates in those cases. > Because the change to the block allocation doesn't represent an "explicit" change to the inode. We will have bumped the ctime on the original write (in update_time), but the follow-on changes that occur due to that write needn't be counted as they aren't visible to the client. It's possible for a client to issue a read between the write and the flush and get the interim value for i_version. Then, once the write happens and the i_version gets bumped again, the client invalidates its cache even though it needn't do so. The race window ought to be relatively small, and this wouldn't result in incorrect behavior that you'd notice (other than loss of performance), but it's not ideal. We're doing more on-the-wire reads than are necessary in this case. It would be nice to have it not do that. If we end up taking this patch to make it elide the i_version bumps on atime updates, we may be able to set the the NOIVER flag in other cases as well, and avoid some of these extra bumps.
On Sun, Aug 28, 2022 at 5:37 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Sun, 2022-08-28 at 16:25 +0300, Amir Goldstein wrote: > > On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > > > On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote: > > > > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote: > > > > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote: > > > > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote: > > > > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein > > > > > > > <amir73il@gmail.com> wrote: > > > > > > > > > > > > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton > > > > > > > > <jlayton@kernel.org> wrote: > > > > > > > > > > > > > > > > > > xfs will update the i_version when updating only the atime > > > > > > > > > value, which > > > > > > > > > is not desirable for any of the current consumers of > > > > > > > > > i_version. Doing so > > > > > > > > > leads to unnecessary cache invalidations on NFS and extra > > > > > > > > > measurement > > > > > > > > > activity in IMA. > > > > > > > > > > > > > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that > > > > > > > > > the > > > > > > > > > transaction should not update the i_version. Set that value > > > > > > > > > in > > > > > > > > > xfs_vn_update_time if we're only updating the atime. > > > > > > > > > > > > > > > > > > Cc: Dave Chinner <david@fromorbit.com> > > > > > > > > > Cc: NeilBrown <neilb@suse.de> > > > > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > > > > > > > > Cc: David Wysochanski <dwysocha@redhat.com> > > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > > > > --- > > > > > > > > > fs/xfs/libxfs/xfs_log_format.h | 2 +- > > > > > > > > > fs/xfs/libxfs/xfs_trans_inode.c | 2 +- > > > > > > > > > fs/xfs/xfs_iops.c | 11 +++++++++-- > > > > > > > > > 3 files changed, 11 insertions(+), 4 deletions(-) > > > > > > > > > > > > > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to > > > > > > > > > illustrate > > > > > > > > > the problem. I still think this approach should at least fix > > > > > > > > > the worst > > > > > > > > > problems with atime updates being counted. We can look to > > > > > > > > > carve out > > > > > > > > > other "spurious" i_version updates as we identify them. > > > > > > > > > > > > > > > > > > > > > > > > > AFAIK, "spurious" is only inode blocks map changes due to > > > > > > > > writeback > > > > > > > > of dirty pages. Anybody know about other cases? > > > > > > > > > > > > > > > > Regarding inode blocks map changes, first of all, I don't think > > > > > > > > that there is > > > > > > > > any practical loss from invalidating NFS client cache on dirty > > > > > > > > data writeback, > > > > > > > > because NFS server should be serving cold data most of the > > > > > > > > time. > > > > > > > > If there are a few unneeded cache invalidations they would only > > > > > > > > be temporary. > > > > > > > > > > > > > > > > > > > > > > Unless there is an issue with a writer NFS client that > > > > > > > invalidates its > > > > > > > own attribute > > > > > > > caches on server data writeback? > > > > > > > > > > > > > > > > > > > The client just looks at the file attributes (of which i_version is > > > > > > but > > > > > > one), and if certain attributes have changed (mtime, ctime, > > > > > > i_version, > > > > > > etc...) then it invalidates its cache. > > > > > > > > > > > > In the case of blocks map changes, could that mean a difference in > > > > > > the > > > > > > observable sparse regions of the file? If so, then a READ_PLUS > > > > > > before > > > > > > the change and a READ_PLUS after could give different results. > > > > > > Since > > > > > > that difference is observable by the client, I'd think we'd want to > > > > > > bump > > > > > > i_version for that anyway. > > > > > > > > > > How /is/ READ_PLUS supposed to detect sparse regions, anyway? I know > > > > > that's been the subject of recent debate. At least as far as XFS is > > > > > concerned, a file range can go from hole -> delayed allocation > > > > > reservation -> unwritten extent -> (actual writeback) -> written > > > > > extent. > > > > > The dance became rather more complex when we added COW. If any of > > > > > that > > > > > will make a difference for READ_PLUS, then yes, I think you'd want > > > > > file > > > > > writeback activities to bump iversion to cause client invalidations, > > > > > like (I think) Dave said. > > > > > > > > > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for > > > > > written and delalloc extents; and an unwritten extent will report > > > > > data > > > > > for any pagecache it finds. > > > > > > > > > > > > > READ_PLUS should never return anything different than a read() system > > > > call would return for any given area. The way it reports sparse regions > > > > vs. data regions is purely an RPC formatting convenience. > > > > > > > > The only point to note about NFS READ and READ_PLUS is that because the > > > > client is forced to send multiple RPC calls if the user is trying to > > > > read a region that is larger than the 'rsize' value, it is possible > > > > that these READ/READ_PLUS calls may be processed out of order, and so > > > > the result may end up looking different than if you had executed a > > > > read() call for the full region directly on the server. > > > > However each individual READ / READ_PLUS reply should look as if the > > > > user had called read() on that rsize-sized section of the file. > > > > > > > > > > > > Yeah, thinking about it some more, simply changing the block allocation > > > is not something that should affect the ctime, so we probably don't want > > > to bump i_version on it. It's an implicit change, IOW, not an explicit > > > one. > > > > > > The fact that xfs might do that is unfortunate, but it's not the end of > > > the world and it still would conform to the proposed definition for > > > i_version. In practice, this sort of allocation change should come soon > > > after the file was written, so one would hope that any damage due to the > > > false i_version bump would be minimized. > > > > > > > That was exactly my point. > > > > > It would be nice to teach it not to do that however. Maybe we can insert > > > the NOIVER flag at a strategic place to avoid it? > > > > Why would that be nice to avoid? > > You did not specify any use case where incrementing i_version > > on block mapping change matters in practice. > > On the contrary, you said that NFS client writer sends COMMIT on close, > > which should stabilize i_version for the next readers. > > > > Given that we already have an xfs implementation that does increment > > i_version on block mapping changes and it would be a pain to change > > that or add a new user options, I don't see the point in discussing it further > > unless there is a good incentive for avoiding i_version updates in those cases. > > > > Because the change to the block allocation doesn't represent an > "explicit" change to the inode. We will have bumped the ctime on the > original write (in update_time), but the follow-on changes that occur > due to that write needn't be counted as they aren't visible to the > client. > > It's possible for a client to issue a read between the write and the > flush and get the interim value for i_version. Then, once the write > happens and the i_version gets bumped again, the client invalidates its > cache even though it needn't do so. > > The race window ought to be relatively small, and this wouldn't result > in incorrect behavior that you'd notice (other than loss of > performance), but it's not ideal. We're doing more on-the-wire reads > than are necessary in this case. > > It would be nice to have it not do that. If we end up taking this patch > to make it elide the i_version bumps on atime updates, we may be able to > set the the NOIVER flag in other cases as well, and avoid some of these > extra bumps. Ok. The use case is that a client's writes on an open file can end up invalidating the same client's read cache on the same file. Sounds like implementing write delegations would have been better in this case than relying on i_version, but I do not know how hard that is and how many clients would request write delegations. Anyway, if you ever send a proposal to change the semantics of xfs i_version bumping on writeback, please specify the target use case and how likely it is for real workloads to suffer from it. Another question to the NFS experts - what happens when the server crashes before dirty data hits persistent storage but the client has already cached the dirty data. After the crash, the filesystem can come up with the same i_version that the NFS client has cached, but without the dirty data. Does the spec address this scenario? Thanks, Amir.
> > > > > > > > Forensics and other applications that care about atime updates can and > > > > should check atime and don't need i_version to know that it was changed. > > > > The reliability of atime as an audit tool has dropped considerably since > > > > the default in relatime. > > I've been waiting for Amir to appear in this discussion -- ISTR that a > few years ago you were wanting the ability to scan a filesystem to look > for files that have changed since a given point. If XFS exported its > di_changecount file attribute (as it currently behaves) via BULKSTAT, > you'd have the ability to do that, so long as your application could > persist bulkstat data and compare. > It's true that exporting i_version via BULKSTAT could be useful to some backup/sync applications. For my case, I was interested in something a bit different - finding all the changes in a very large fs tree - or IOW finding if anything has changed inside a large tree since a point in time. So not sure if i_version would help for that use case. Thanks, Amir.
On Sun, Aug 28, 2022 at 10:37:37AM -0400, Jeff Layton wrote: > On Sun, 2022-08-28 at 16:25 +0300, Amir Goldstein wrote: > > On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote: > > > Yeah, thinking about it some more, simply changing the block allocation > > > is not something that should affect the ctime, so we probably don't want > > > to bump i_version on it. It's an implicit change, IOW, not an explicit > > > one. > > > > > > The fact that xfs might do that is unfortunate, but it's not the end of > > > the world and it still would conform to the proposed definition for > > > i_version. In practice, this sort of allocation change should come soon > > > after the file was written, so one would hope that any damage due to the > > > false i_version bump would be minimized. > > > > > > > That was exactly my point. > > > > > It would be nice to teach it not to do that however. Maybe we can insert > > > the NOIVER flag at a strategic place to avoid it? No, absolutely not. I've already explained this: The XFS *disk format specification* says that di_changecount is bumped for every change that is made to the inode. Applications that are written from this specification expect the on disk format for a XFS given filesystem feature to remain the same until it is either deprecated and removed or we add feature flags to indicate it has different behaviour. We can't just change the behaviour at a whim. And that's ignoring the fact that randomly spewing NOIVER into transactions that modify inode metadata is a nasty hack - it is not desirable from a design or documentation POV, nor is it maintainable. > > Why would that be nice to avoid? > > You did not specify any use case where incrementing i_version > > on block mapping change matters in practice. > > On the contrary, you said that NFS client writer sends COMMIT on close, > > which should stabilize i_version for the next readers. > > > > Given that we already have an xfs implementation that does increment > > i_version on block mapping changes and it would be a pain to change > > that or add a new user options, I don't see the point in discussing it further > > unless there is a good incentive for avoiding i_version updates in those cases. > > > > Because the change to the block allocation doesn't represent an > "explicit" change to the inode. We will have bumped the ctime on the > original write (in update_time), but the follow-on changes that occur > due to that write needn't be counted as they aren't visible to the > client. > > It's possible for a client to issue a read between the write and the > flush and get the interim value for i_version. Then, once the write > happens and the i_version gets bumped again, the client invalidates its > cache even though it needn't do so. > > The race window ought to be relatively small, and this wouldn't result > in incorrect behavior that you'd notice (other than loss of > performance), but it's not ideal. We're doing more on-the-wire reads > than are necessary in this case. > > It would be nice to have it not do that. If we end up taking this patch > to make it elide the i_version bumps on atime updates, we may be able to > set the the NOIVER flag in other cases as well, and avoid some of these > extra bumps. <sigh> Please don't make me repeat myself for the third time. Once we have decided on a solid, unchanging definition for the *statx user API variable*, we'll implement a new on-disk field that provides this information. We will document it in the on-disk specification as "this is how di_iversion behaves" so that it is clear to everyone parsing the on-disk format or writing their own XFS driver how to implement it and when to expect it to change. Then we can add a filesystem and inode feature flags that say "inode has new iversion" and we use that to populate the kernel iversion instead of di_changecount. We keep di_changecount exactly the way it is now for the applications and use cases we already have for that specific behaviour. If the kernel and/or filesystem don't support the new di_iversion field, then we'll use di_changecount as it currently exists for the kernel iversion code. Keep in mind that we've been doing dynamic inode format updates in XFS for a couple of decades - users don't even have to be aware that they need to perform format upgrades because often they just happen whenever an inode is accessed. IOWs, just because we have to change the on-disk format to support this new iversion definition, it doesn't mean users have to reformat filesystems before the new feature can be used. Hence, over time, as distros update kernels, the XFS iversion behaviour will change automagically as we update inodes in existing filesystems as they are accessed to add and then use the new di_iversion field for the VFS change attribute field instead of the di_changecount field... -Dave.
On Mon, 2022-08-29 at 15:48 +1000, Dave Chinner wrote: > On Sun, Aug 28, 2022 at 10:37:37AM -0400, Jeff Layton wrote: > > On Sun, 2022-08-28 at 16:25 +0300, Amir Goldstein wrote: > > > On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > Yeah, thinking about it some more, simply changing the block allocation > > > > is not something that should affect the ctime, so we probably don't want > > > > to bump i_version on it. It's an implicit change, IOW, not an explicit > > > > one. > > > > > > > > The fact that xfs might do that is unfortunate, but it's not the end of > > > > the world and it still would conform to the proposed definition for > > > > i_version. In practice, this sort of allocation change should come soon > > > > after the file was written, so one would hope that any damage due to the > > > > false i_version bump would be minimized. > > > > > > > > > > That was exactly my point. > > > > > > > It would be nice to teach it not to do that however. Maybe we can insert > > > > the NOIVER flag at a strategic place to avoid it? > > No, absolutely not. > > I've already explained this: The XFS *disk format specification* > says that di_changecount is bumped for every change that is made to > the inode. > > Applications that are written from this specification expect the on > disk format for a XFS given filesystem feature to remain the same > until it is either deprecated and removed or we add feature flags to > indicate it has different behaviour. We can't just change the > behaviour at a whim. > > And that's ignoring the fact that randomly spewing NOIVER > into transactions that modify inode metadata is a nasty hack - it > is not desirable from a design or documentation POV, nor is it > maintainable. > > > > Why would that be nice to avoid? > > > You did not specify any use case where incrementing i_version > > > on block mapping change matters in practice. > > > On the contrary, you said that NFS client writer sends COMMIT on close, > > > which should stabilize i_version for the next readers. > > > > > > Given that we already have an xfs implementation that does increment > > > i_version on block mapping changes and it would be a pain to change > > > that or add a new user options, I don't see the point in discussing it further > > > unless there is a good incentive for avoiding i_version updates in those cases. > > > > > > > Because the change to the block allocation doesn't represent an > > "explicit" change to the inode. We will have bumped the ctime on the > > original write (in update_time), but the follow-on changes that occur > > due to that write needn't be counted as they aren't visible to the > > client. > > > > It's possible for a client to issue a read between the write and the > > flush and get the interim value for i_version. Then, once the write > > happens and the i_version gets bumped again, the client invalidates its > > cache even though it needn't do so. > > > > The race window ought to be relatively small, and this wouldn't result > > in incorrect behavior that you'd notice (other than loss of > > performance), but it's not ideal. We're doing more on-the-wire reads > > than are necessary in this case. > > > > It would be nice to have it not do that. If we end up taking this patch > > to make it elide the i_version bumps on atime updates, we may be able to > > set the the NOIVER flag in other cases as well, and avoid some of these > > extra bumps. > > > <sigh> > > Please don't make me repeat myself for the third time. > > Once we have decided on a solid, unchanging definition for the > *statx user API variable*, we'll implement a new on-disk field that > provides this information. We will document it in the on-disk > specification as "this is how di_iversion behaves" so that it is > clear to everyone parsing the on-disk format or writing their own > XFS driver how to implement it and when to expect it to > change. > > Then we can add a filesystem and inode feature flags that say "inode > has new iversion" and we use that to populate the kernel iversion > instead of di_changecount. We keep di_changecount exactly the way it > is now for the applications and use cases we already have for that > specific behaviour. If the kernel and/or filesystem don't support > the new di_iversion field, then we'll use di_changecount as it > currently exists for the kernel iversion code. > Aside from NFS and IMA, what applications are dependent on the current definition and how do they rely on i_version today? > Keep in mind that we've been doing dynamic inode format updates in > XFS for a couple of decades - users don't even have to be aware that > they need to perform format upgrades because often they just happen > whenever an inode is accessed. IOWs, just because we have to change > the on-disk format to support this new iversion definition, it > doesn't mean users have to reformat filesystems before the new > feature can be used. > > Hence, over time, as distros update kernels, the XFS iversion > behaviour will change automagically as we update inodes in existing > filesystems as they are accessed to add and then use the new > di_iversion field for the VFS change attribute field instead of the > di_changecount field... > If you want to create a whole new on-disk field for this, then that's your prerogative, but before you do that, I'd like to better understand why and how the constraints on this field changed. The original log message from the commit that added a change counter (below) stated that you were adding it for network filesystems like NFS. When did this change and why? commit dc037ad7d24f3711e431a45c053b5d425995e9e4 Author: Dave Chinner <dchinner@redhat.com> Date: Thu Jun 27 16:04:59 2013 +1000 xfs: implement inode change count For CRC enabled filesystems, add support for the monotonic inode version change counter that is needed by protocols like NFSv4 for determining if the inode has changed in any way at all between two unrelated operations on the inode. This bumps the change count the first time an inode is dirtied in a transaction. Since all modifications to the inode are logged, this will catch all changes that are made to the inode, including timestamp updates that occur during data writes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Ben Myers <bpm@sgi.com>
On Mon, Aug 29, 2022 at 06:33:48AM -0400, Jeff Layton wrote: > On Mon, 2022-08-29 at 15:48 +1000, Dave Chinner wrote: > > > > > > The race window ought to be relatively small, and this wouldn't result > > > in incorrect behavior that you'd notice (other than loss of > > > performance), but it's not ideal. We're doing more on-the-wire reads > > > than are necessary in this case. > > > > > > It would be nice to have it not do that. If we end up taking this patch > > > to make it elide the i_version bumps on atime updates, we may be able to > > > set the the NOIVER flag in other cases as well, and avoid some of these > > > extra bumps. > > > > > > <sigh> > > > > Please don't make me repeat myself for the third time. > > > > Once we have decided on a solid, unchanging definition for the > > *statx user API variable*, we'll implement a new on-disk field that > > provides this information. We will document it in the on-disk > > specification as "this is how di_iversion behaves" so that it is > > clear to everyone parsing the on-disk format or writing their own > > XFS driver how to implement it and when to expect it to > > change. > > > > Then we can add a filesystem and inode feature flags that say "inode > > has new iversion" and we use that to populate the kernel iversion > > instead of di_changecount. We keep di_changecount exactly the way it > > is now for the applications and use cases we already have for that > > specific behaviour. If the kernel and/or filesystem don't support > > the new di_iversion field, then we'll use di_changecount as it > > currently exists for the kernel iversion code. > > > > Aside from NFS and IMA, what applications are dependent on the current > definition and how do they rely on i_version today? I've answered this multiple times already: the di_changecount behaviour is defined in the on-disk specification and hence we *cannot change the behaviour* without changing the on-disk format specification. Apart from the forensics aspect of the change counter (which nobody but us XFS developers seem to understand just how damn important this is), there are *many* third party applications that parse the XFS on-disk format directly. This: https://codesearch.debian.net/search?q=XFS_SB_VERSION_DIRV2&literal=1 Shows grub2, libparted, syslinux, partclone and fsarchiver as knowing about XFS on-disk superblock flags that tell them what format the directory structure is in. That alone is enough to indicate they parse on-disk inodes directly, and hence may expect di_changecount to have specific meaning and use it to detect unexpected changes to files/directories they care about. If I go looking for XFS_SB_MAGIC, I find things like libblkid, klibc, qemu, Xen, testdisk, gpart, and virtualbox all parse the on-disk superblocks directly from the block device, too. They also rely directly on XFS developers ensuring there are no silent incomaptible changes to the on disk format. I also know of many other utilities that people and companies have written that parse the on disk format directly from userspace. The functions these perform include low level storage management tools, copying and managing disk images (e.g. offline configuration for cluster deployments), data recovery tools that scrape all the data out of broken filesystems, etc. These applications are reliant on the guarantee we provide that the on-disk format will not silently change and that behaviour/structure can always easily be discovered by feature flags in the superblock and/or inodes. IOWs, just because there aren't obvious "traditional" application on top of the kernel filesystem that consumes the in-memory kernel iversion field, it does not mean that the defined behaviour of the on-disk di_changecount field is not used or relied on by other tools that work directly on the on-disk format. You might be right that NFS doesn't care about this, but the point remains that NFS does not control the XFS on-disk format, nor does the fact that what NFS wants from the change attribute has changed over time override the fact that maintaining XFS on-disk format compatibility is the responsibility of XFS developers. We're willing to change the on-disk format to support whatever the new definition of the statx change attribute ends up being, and that should be the end of the discussion. > > Keep in mind that we've been doing dynamic inode format updates in > > XFS for a couple of decades - users don't even have to be aware that > > they need to perform format upgrades because often they just happen > > whenever an inode is accessed. IOWs, just because we have to change > > the on-disk format to support this new iversion definition, it > > doesn't mean users have to reformat filesystems before the new > > feature can be used. > > > > Hence, over time, as distros update kernels, the XFS iversion > > behaviour will change automagically as we update inodes in existing > > filesystems as they are accessed to add and then use the new > > di_iversion field for the VFS change attribute field instead of the > > di_changecount field... > > > > If you want to create a whole new on-disk field for this, then that's > your prerogative, but before you do that, I'd like to better understand > why and how the constraints on this field changed. > > The original log message from the commit that added a change counter > (below) stated that you were adding it for network filesystems like NFS. > When did this change and why? It never changed. I'll repeat what I've already explained twice before: https://lore.kernel.org/linux-xfs/20220818030048.GE3600936@dread.disaster.area/ https://lore.kernel.org/linux-xfs/20220818033731.GF3600936@dread.disaster.area/ tl; dr: NFS requirements were just one of *many* we had at the time for an atomic persistent change counter. The fact is that NFS users are just going to have to put up with random cache invalidations on XFS for a while longer. Nobody noticed this and/or cared about this enough to raise it as an issue for the past decade, so waiting another few months for upstream XFS to change to a different on-disk format for the NFS/statx change attribute isn't a big deal. -Dave.
On Tue, 2022-08-30 at 10:08 +1000, Dave Chinner wrote: > On Mon, Aug 29, 2022 at 06:33:48AM -0400, Jeff Layton wrote: > > On Mon, 2022-08-29 at 15:48 +1000, Dave Chinner wrote: > > > > > > > > The race window ought to be relatively small, and this wouldn't result > > > > in incorrect behavior that you'd notice (other than loss of > > > > performance), but it's not ideal. We're doing more on-the-wire reads > > > > than are necessary in this case. > > > > > > > > It would be nice to have it not do that. If we end up taking this patch > > > > to make it elide the i_version bumps on atime updates, we may be able to > > > > set the the NOIVER flag in other cases as well, and avoid some of these > > > > extra bumps. > > > > > > > > > <sigh> > > > > > > Please don't make me repeat myself for the third time. > > > > > > Once we have decided on a solid, unchanging definition for the > > > *statx user API variable*, we'll implement a new on-disk field that > > > provides this information. We will document it in the on-disk > > > specification as "this is how di_iversion behaves" so that it is > > > clear to everyone parsing the on-disk format or writing their own > > > XFS driver how to implement it and when to expect it to > > > change. > > > > > > Then we can add a filesystem and inode feature flags that say "inode > > > has new iversion" and we use that to populate the kernel iversion > > > instead of di_changecount. We keep di_changecount exactly the way it > > > is now for the applications and use cases we already have for that > > > specific behaviour. If the kernel and/or filesystem don't support > > > the new di_iversion field, then we'll use di_changecount as it > > > currently exists for the kernel iversion code. > > > > > > > Aside from NFS and IMA, what applications are dependent on the current > > definition and how do they rely on i_version today? > > I've answered this multiple times already: the di_changecount > behaviour is defined in the on-disk specification and hence we > *cannot change the behaviour* without changing the on-disk format > specification. > > Apart from the forensics aspect of the change counter (which nobody > but us XFS developers seem to understand just how damn important > this is), there are *many* third party applications that parse the > XFS on-disk format directly. This: > > https://codesearch.debian.net/search?q=XFS_SB_VERSION_DIRV2&literal=1 > > Shows grub2, libparted, syslinux, partclone and fsarchiver as > knowing about XFS on-disk superblock flags that tell them what > format the directory structure is in. That alone is enough to > indicate they parse on-disk inodes directly, and hence may expect > di_changecount to have specific meaning and use it to detect > unexpected changes to files/directories they care about. > > If I go looking for XFS_SB_MAGIC, I find things like libblkid, > klibc, qemu, Xen, testdisk, gpart, and virtualbox all parse the > on-disk superblocks directly from the block device, too. They also > rely directly on XFS developers ensuring there are no silent > incomaptible changes to the on disk format. > > I also know of many other utilities that people and companies have > written that parse the on disk format directly from userspace. The > functions these perform include low level storage management tools, > copying and managing disk images (e.g. offline configuration for > cluster deployments), data recovery tools that scrape all the data > out of broken filesystems, etc. > > These applications are reliant on the guarantee we provide that the > on-disk format will not silently change and that behaviour/structure > can always easily be discovered by feature flags in the superblock > and/or inodes. > > IOWs, just because there aren't obvious "traditional" application on > top of the kernel filesystem that consumes the in-memory kernel > iversion field, it does not mean that the defined behaviour of the > on-disk di_changecount field is not used or relied on by other tools > that work directly on the on-disk format. > > You might be right that NFS doesn't care about this, but the point > remains that NFS does not control the XFS on-disk format, nor does > the fact that what NFS wants from the change attribute has changed > over time override the fact that maintaining XFS on-disk format > compatibility is the responsibility of XFS developers. We're willing > to change the on-disk format to support whatever the new definition > of the statx change attribute ends up being, and that should be the > end of the discussion. > Thanks for spelling this out in more detail. > > > Keep in mind that we've been doing dynamic inode format updates in > > > XFS for a couple of decades - users don't even have to be aware that > > > they need to perform format upgrades because often they just happen > > > whenever an inode is accessed. IOWs, just because we have to change > > > the on-disk format to support this new iversion definition, it > > > doesn't mean users have to reformat filesystems before the new > > > feature can be used. > > > > > > Hence, over time, as distros update kernels, the XFS iversion > > > behaviour will change automagically as we update inodes in existing > > > filesystems as they are accessed to add and then use the new > > > di_iversion field for the VFS change attribute field instead of the > > > di_changecount field... > > > > > > > If you want to create a whole new on-disk field for this, then that's > > your prerogative, but before you do that, I'd like to better understand > > why and how the constraints on this field changed. > > > > The original log message from the commit that added a change counter > > (below) stated that you were adding it for network filesystems like NFS. > > When did this change and why? > > It never changed. I'll repeat what I've already explained twice > before: > > https://lore.kernel.org/linux-xfs/20220818030048.GE3600936@dread.disaster.area/ > https://lore.kernel.org/linux-xfs/20220818033731.GF3600936@dread.disaster.area/ > > tl; dr: NFS requirements were just one of *many* we had at the time > for an atomic persistent change counter. > > The fact is that NFS users are just going to have to put up with > random cache invalidations on XFS for a while longer. Nobody noticed > this and/or cared about this enough to raise it as an issue for the > past decade, so waiting another few months for upstream XFS to > change to a different on-disk format for the NFS/statx change > attribute isn't a big deal. > Fair enough. I'll plan to drop this patch from the series for now, with the expectation that you guys will add a new i_version counter that better conforms to what NFS and IMA need. Thanks,
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h index b351b9dc6561..866a4c5cf70c 100644 --- a/fs/xfs/libxfs/xfs_log_format.h +++ b/fs/xfs/libxfs/xfs_log_format.h @@ -323,7 +323,7 @@ struct xfs_inode_log_format_32 { #define XFS_ILOG_ABROOT 0x100 /* log i_af.i_broot */ #define XFS_ILOG_DOWNER 0x200 /* change the data fork owner on replay */ #define XFS_ILOG_AOWNER 0x400 /* change the attr fork owner on replay */ - +#define XFS_ILOG_NOIVER 0x800 /* don't bump i_version */ /* * The timestamps are dirty, but not necessarily anything else in the inode diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c index 8b5547073379..ffe6d296e7f9 100644 --- a/fs/xfs/libxfs/xfs_trans_inode.c +++ b/fs/xfs/libxfs/xfs_trans_inode.c @@ -126,7 +126,7 @@ xfs_trans_log_inode( * unconditionally. */ if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) { - if (IS_I_VERSION(inode) && + if (!(flags & XFS_ILOG_NOIVER) && IS_I_VERSION(inode) && inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) iversion_flags = XFS_ILOG_CORE; } diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 45518b8c613c..94f14d96641b 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1041,10 +1041,17 @@ xfs_vn_update_time( return error; xfs_ilock(ip, XFS_ILOCK_EXCL); - if (flags & S_CTIME) + + if (!(flags & S_VERSION)) + log_flags |= XFS_ILOG_NOIVER; + if (flags & S_CTIME) { inode->i_ctime = *now; - if (flags & S_MTIME) + log_flags &= ~XFS_ILOG_NOIVER; + } + if (flags & S_MTIME) { inode->i_mtime = *now; + log_flags &= ~XFS_ILOG_NOIVER; + } if (flags & S_ATIME) inode->i_atime = *now;
xfs will update the i_version when updating only the atime value, which is not desirable for any of the current consumers of i_version. Doing so leads to unnecessary cache invalidations on NFS and extra measurement activity in IMA. Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the transaction should not update the i_version. Set that value in xfs_vn_update_time if we're only updating the atime. Cc: Dave Chinner <david@fromorbit.com> Cc: NeilBrown <neilb@suse.de> Cc: Trond Myklebust <trondmy@hammerspace.com> Cc: David Wysochanski <dwysocha@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> --- fs/xfs/libxfs/xfs_log_format.h | 2 +- fs/xfs/libxfs/xfs_trans_inode.c | 2 +- fs/xfs/xfs_iops.c | 11 +++++++++-- 3 files changed, 11 insertions(+), 4 deletions(-) Dave has NACK'ed this patch, but I'm sending it as a way to illustrate the problem. I still think this approach should at least fix the worst problems with atime updates being counted. We can look to carve out other "spurious" i_version updates as we identify them. If however there are offline analysis tools that require atime updates to be counted, then we won't be able to do this. If that's the case, how can we fix this such that serving xfs via NFSv4 doesn't suck?