diff mbox series

[v3,4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time

Message ID 20220826214703.134870-5-jlayton@kernel.org (mailing list archive)
State New, archived
Headers show
Series vfs: clean up i_version behavior and expose it via statx | expand

Commit Message

Jeff Layton Aug. 26, 2022, 9:47 p.m. UTC
xfs will update the i_version when updating only the atime value, which
is not desirable for any of the current consumers of i_version. Doing so
leads to unnecessary cache invalidations on NFS and extra measurement
activity in IMA.

Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the
transaction should not update the i_version. Set that value in
xfs_vn_update_time if we're only updating the atime.

Cc: Dave Chinner <david@fromorbit.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trondmy@hammerspace.com>
Cc: David Wysochanski <dwysocha@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/xfs/libxfs/xfs_log_format.h  |  2 +-
 fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
 fs/xfs/xfs_iops.c               | 11 +++++++++--
 3 files changed, 11 insertions(+), 4 deletions(-)

Dave has NACK'ed this patch, but I'm sending it as a way to illustrate
the problem. I still think this approach should at least fix the worst
problems with atime updates being counted. We can look to carve out
other "spurious" i_version updates as we identify them.

If however there are offline analysis tools that require atime updates
to be counted, then we won't be able to do this. If that's the case, how
can we fix this such that serving xfs via NFSv4 doesn't suck?

Comments

Amir Goldstein Aug. 27, 2022, 7:26 a.m. UTC | #1
On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> xfs will update the i_version when updating only the atime value, which
> is not desirable for any of the current consumers of i_version. Doing so
> leads to unnecessary cache invalidations on NFS and extra measurement
> activity in IMA.
>
> Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the
> transaction should not update the i_version. Set that value in
> xfs_vn_update_time if we're only updating the atime.
>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Trond Myklebust <trondmy@hammerspace.com>
> Cc: David Wysochanski <dwysocha@redhat.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
>  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
>  fs/xfs/xfs_iops.c               | 11 +++++++++--
>  3 files changed, 11 insertions(+), 4 deletions(-)
>
> Dave has NACK'ed this patch, but I'm sending it as a way to illustrate
> the problem. I still think this approach should at least fix the worst
> problems with atime updates being counted. We can look to carve out
> other "spurious" i_version updates as we identify them.
>

AFAIK, "spurious" is only inode blocks map changes due to writeback
of dirty pages. Anybody know about other cases?

Regarding inode blocks map changes, first of all, I don't think that there is
any practical loss from invalidating NFS client cache on dirty data writeback,
because NFS server should be serving cold data most of the time.
If there are a few unneeded cache invalidations they would only be temporary.

One may even consider if NFSv4 server should not flush dirty data of an inode
before granting a read lease to client.
After all, if read lease was granted, client cached data and then server crashed
before persisting the dirty data, then client will have cached a
"future" version
of the data and if i_version on the server did not roll back in that situation,
we are looking at possible data corruptions.

Same goes for IMA. IIUC, IMA data checksum would be stored in xattr?
Storing in xattr a data checksum for data that is not persistent on disk
would be an odd choice.

So in my view, I only see benefits to current i_version users in the xfs
i_version implementations and I don't think that it contradicts the
i_version definition in the man page patch.

> If however there are offline analysis tools that require atime updates
> to be counted, then we won't be able to do this. If that's the case, how
> can we fix this such that serving xfs via NFSv4 doesn't suck?
>

If I read the arguments correctly, implicit atime updates could be relaxed
as long as this behavior is clearly documented and coherent on all
implementations.

Forensics and other applications that care about atime updates can and
should check atime and don't need i_version to know that it was changed.
The reliability of atime as an audit tool has dropped considerably since
the default in relatime.
If we want to be paranoid, maybe we can leave i_version increment on
atime updates in case the user opted-in to strict '-o atime' updates, but
IMO, there is no need for that.

Thanks,
Amir.
Amir Goldstein Aug. 27, 2022, 8:01 a.m. UTC | #2
On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > xfs will update the i_version when updating only the atime value, which
> > is not desirable for any of the current consumers of i_version. Doing so
> > leads to unnecessary cache invalidations on NFS and extra measurement
> > activity in IMA.
> >
> > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the
> > transaction should not update the i_version. Set that value in
> > xfs_vn_update_time if we're only updating the atime.
> >
> > Cc: Dave Chinner <david@fromorbit.com>
> > Cc: NeilBrown <neilb@suse.de>
> > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > Cc: David Wysochanski <dwysocha@redhat.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> >  3 files changed, 11 insertions(+), 4 deletions(-)
> >
> > Dave has NACK'ed this patch, but I'm sending it as a way to illustrate
> > the problem. I still think this approach should at least fix the worst
> > problems with atime updates being counted. We can look to carve out
> > other "spurious" i_version updates as we identify them.
> >
>
> AFAIK, "spurious" is only inode blocks map changes due to writeback
> of dirty pages. Anybody know about other cases?
>
> Regarding inode blocks map changes, first of all, I don't think that there is
> any practical loss from invalidating NFS client cache on dirty data writeback,
> because NFS server should be serving cold data most of the time.
> If there are a few unneeded cache invalidations they would only be temporary.
>

Unless there is an issue with a writer NFS client that invalidates its
own attribute
caches on server data writeback?

> One may even consider if NFSv4 server should not flush dirty data of an inode
> before granting a read lease to client.
> After all, if read lease was granted, client cached data and then server crashed
> before persisting the dirty data, then client will have cached a
> "future" version
> of the data and if i_version on the server did not roll back in that situation,
> we are looking at possible data corruptions.
>
> Same goes for IMA. IIUC, IMA data checksum would be stored in xattr?
> Storing in xattr a data checksum for data that is not persistent on disk
> would be an odd choice.
>
> So in my view, I only see benefits to current i_version users in the xfs
> i_version implementations and I don't think that it contradicts the
> i_version definition in the man page patch.
>
> > If however there are offline analysis tools that require atime updates
> > to be counted, then we won't be able to do this. If that's the case, how
> > can we fix this such that serving xfs via NFSv4 doesn't suck?
> >
>
> If I read the arguments correctly, implicit atime updates could be relaxed
> as long as this behavior is clearly documented and coherent on all
> implementations.
>
> Forensics and other applications that care about atime updates can and
> should check atime and don't need i_version to know that it was changed.
> The reliability of atime as an audit tool has dropped considerably since
> the default in relatime.
> If we want to be paranoid, maybe we can leave i_version increment on
> atime updates in case the user opted-in to strict '-o atime' updates, but
> IMO, there is no need for that.
>
> Thanks,
> Amir.
Jeff Layton Aug. 27, 2022, 1:14 p.m. UTC | #3
On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > 
> > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > 
> > > xfs will update the i_version when updating only the atime value, which
> > > is not desirable for any of the current consumers of i_version. Doing so
> > > leads to unnecessary cache invalidations on NFS and extra measurement
> > > activity in IMA.
> > > 
> > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the
> > > transaction should not update the i_version. Set that value in
> > > xfs_vn_update_time if we're only updating the atime.
> > > 
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Cc: NeilBrown <neilb@suse.de>
> > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > 
> > > Dave has NACK'ed this patch, but I'm sending it as a way to illustrate
> > > the problem. I still think this approach should at least fix the worst
> > > problems with atime updates being counted. We can look to carve out
> > > other "spurious" i_version updates as we identify them.
> > > 
> > 
> > AFAIK, "spurious" is only inode blocks map changes due to writeback
> > of dirty pages. Anybody know about other cases?
> > 
> > Regarding inode blocks map changes, first of all, I don't think that there is
> > any practical loss from invalidating NFS client cache on dirty data writeback,
> > because NFS server should be serving cold data most of the time.
> > If there are a few unneeded cache invalidations they would only be temporary.
> > 
> 
> Unless there is an issue with a writer NFS client that invalidates its
> own attribute
> caches on server data writeback?
> 

The client just looks at the file attributes (of which i_version is but
one), and if certain attributes have changed (mtime, ctime, i_version,
etc...) then it invalidates its cache.

In the case of blocks map changes, could that mean a difference in the
observable sparse regions of the file? If so, then a READ_PLUS before
the change and a READ_PLUS after could give different results. Since
that difference is observable by the client, I'd think we'd want to bump
i_version for that anyway.

> > One may even consider if NFSv4 server should not flush dirty data of an inode
> > before granting a read lease to client.
> > After all, if read lease was granted, client cached data and then server crashed
> > before persisting the dirty data, then client will have cached a
> > "future" version
> > of the data and if i_version on the server did not roll back in that situation,
> > we are looking at possible data corruptions.
> > 

We don't hand out read leases if there are file descriptions open for
write. NFS clients usually issue a COMMIT before closing a stateid in
order to satisfy close-to-open cache coherency.

So in most cases, this is probably not an issue. It might still be
worthwhile to make sure of it by doing a filemap_write_and_wait before
we hand out a delegation, but that's likely to be a no-op in most cases
anyway.

Note too that the client will still revalidate its caches when it
receives attributes even when it holds a read delegation. In fact, this
behavior mostly papered over a rather nasty knfsd bug we found recently
where it was allowing conflicting activity to proceed even when there
was a read delegation outstanding.
 
> > Same goes for IMA. IIUC, IMA data checksum would be stored in xattr?
> > Storing in xattr a data checksum for data that is not persistent on disk
> > would be an odd choice.
> > 
> > So in my view, I only see benefits to current i_version users in the xfs
> > i_version implementations and I don't think that it contradicts the
> > i_version definition in the man page patch.
> > 
> > > If however there are offline analysis tools that require atime updates
> > > to be counted, then we won't be able to do this. If that's the case, how
> > > can we fix this such that serving xfs via NFSv4 doesn't suck?
> > > 
> > 
> > If I read the arguments correctly, implicit atime updates could be relaxed
> > as long as this behavior is clearly documented and coherent on all
> > implementations.
> > 
> > Forensics and other applications that care about atime updates can and
> > should check atime and don't need i_version to know that it was changed.
> > The reliability of atime as an audit tool has dropped considerably since
> > the default in relatime.
> > If we want to be paranoid, maybe we can leave i_version increment on
> > atime updates in case the user opted-in to strict '-o atime' updates, but
> > IMO, there is no need for that.
> > 

Thanks,
Darrick J. Wong Aug. 27, 2022, 3:46 p.m. UTC | #4
On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > 
> > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > xfs will update the i_version when updating only the atime value, which
> > > > is not desirable for any of the current consumers of i_version. Doing so
> > > > leads to unnecessary cache invalidations on NFS and extra measurement
> > > > activity in IMA.
> > > > 
> > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the
> > > > transaction should not update the i_version. Set that value in
> > > > xfs_vn_update_time if we're only updating the atime.
> > > > 
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > Cc: NeilBrown <neilb@suse.de>
> > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > 
> > > > Dave has NACK'ed this patch, but I'm sending it as a way to illustrate
> > > > the problem. I still think this approach should at least fix the worst
> > > > problems with atime updates being counted. We can look to carve out
> > > > other "spurious" i_version updates as we identify them.
> > > > 
> > > 
> > > AFAIK, "spurious" is only inode blocks map changes due to writeback
> > > of dirty pages. Anybody know about other cases?
> > > 
> > > Regarding inode blocks map changes, first of all, I don't think that there is
> > > any practical loss from invalidating NFS client cache on dirty data writeback,
> > > because NFS server should be serving cold data most of the time.
> > > If there are a few unneeded cache invalidations they would only be temporary.
> > > 
> > 
> > Unless there is an issue with a writer NFS client that invalidates its
> > own attribute
> > caches on server data writeback?
> > 
> 
> The client just looks at the file attributes (of which i_version is but
> one), and if certain attributes have changed (mtime, ctime, i_version,
> etc...) then it invalidates its cache.
> 
> In the case of blocks map changes, could that mean a difference in the
> observable sparse regions of the file? If so, then a READ_PLUS before
> the change and a READ_PLUS after could give different results. Since
> that difference is observable by the client, I'd think we'd want to bump
> i_version for that anyway.

How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I know
that's been the subject of recent debate.  At least as far as XFS is
concerned, a file range can go from hole -> delayed allocation
reservation -> unwritten extent -> (actual writeback) -> written extent.
The dance became rather more complex when we added COW.  If any of that
will make a difference for READ_PLUS, then yes, I think you'd want file
writeback activities to bump iversion to cause client invalidations,
like (I think) Dave said.

The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for
written and delalloc extents; and an unwritten extent will report data
for any pagecache it finds.

> > > One may even consider if NFSv4 server should not flush dirty data of an inode
> > > before granting a read lease to client.
> > > After all, if read lease was granted, client cached data and then server crashed
> > > before persisting the dirty data, then client will have cached a
> > > "future" version
> > > of the data and if i_version on the server did not roll back in that situation,
> > > we are looking at possible data corruptions.
> > > 
> 
> We don't hand out read leases if there are file descriptions open for
> write. NFS clients usually issue a COMMIT before closing a stateid in
> order to satisfy close-to-open cache coherency.
> 
> So in most cases, this is probably not an issue. It might still be
> worthwhile to make sure of it by doing a filemap_write_and_wait before
> we hand out a delegation, but that's likely to be a no-op in most cases
> anyway.
> 
> Note too that the client will still revalidate its caches when it
> receives attributes even when it holds a read delegation. In fact, this
> behavior mostly papered over a rather nasty knfsd bug we found recently
> where it was allowing conflicting activity to proceed even when there
> was a read delegation outstanding.
>  
> > > Same goes for IMA. IIUC, IMA data checksum would be stored in xattr?
> > > Storing in xattr a data checksum for data that is not persistent on disk
> > > would be an odd choice.
> > > 
> > > So in my view, I only see benefits to current i_version users in the xfs
> > > i_version implementations and I don't think that it contradicts the
> > > i_version definition in the man page patch.
> > > 
> > > > If however there are offline analysis tools that require atime updates
> > > > to be counted, then we won't be able to do this. If that's the case, how
> > > > can we fix this such that serving xfs via NFSv4 doesn't suck?
> > > > 
> > > 
> > > If I read the arguments correctly, implicit atime updates could be relaxed
> > > as long as this behavior is clearly documented and coherent on all
> > > implementations.
> > > 
> > > Forensics and other applications that care about atime updates can and
> > > should check atime and don't need i_version to know that it was changed.
> > > The reliability of atime as an audit tool has dropped considerably since
> > > the default in relatime.

I've been waiting for Amir to appear in this discussion -- ISTR that a
few years ago you were wanting the ability to scan a filesystem to look
for files that have changed since a given point.  If XFS exported its
di_changecount file attribute (as it currently behaves) via BULKSTAT,
you'd have the ability to do that, so long as your application could
persist bulkstat data and compare.

--D

> > > If we want to be paranoid, maybe we can leave i_version increment on
> > > atime updates in case the user opted-in to strict '-o atime' updates, but
> > > IMO, there is no need for that.
> > > 
> 
> Thanks,
> -- 
> Jeff Layton <jlayton@kernel.org>
Trond Myklebust Aug. 27, 2022, 4:03 p.m. UTC | #5
On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > <amir73il@gmail.com> wrote:
> > > > 
> > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > <jlayton@kernel.org> wrote:
> > > > > 
> > > > > xfs will update the i_version when updating only the atime
> > > > > value, which
> > > > > is not desirable for any of the current consumers of
> > > > > i_version. Doing so
> > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > measurement
> > > > > activity in IMA.
> > > > > 
> > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that
> > > > > the
> > > > > transaction should not update the i_version. Set that value
> > > > > in
> > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > 
> > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > Cc: NeilBrown <neilb@suse.de>
> > > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > 
> > > > > Dave has NACK'ed this patch, but I'm sending it as a way to
> > > > > illustrate
> > > > > the problem. I still think this approach should at least fix
> > > > > the worst
> > > > > problems with atime updates being counted. We can look to
> > > > > carve out
> > > > > other "spurious" i_version updates as we identify them.
> > > > > 
> > > > 
> > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > writeback
> > > > of dirty pages. Anybody know about other cases?
> > > > 
> > > > Regarding inode blocks map changes, first of all, I don't think
> > > > that there is
> > > > any practical loss from invalidating NFS client cache on dirty
> > > > data writeback,
> > > > because NFS server should be serving cold data most of the
> > > > time.
> > > > If there are a few unneeded cache invalidations they would only
> > > > be temporary.
> > > > 
> > > 
> > > Unless there is an issue with a writer NFS client that
> > > invalidates its
> > > own attribute
> > > caches on server data writeback?
> > > 
> > 
> > The client just looks at the file attributes (of which i_version is
> > but
> > one), and if certain attributes have changed (mtime, ctime,
> > i_version,
> > etc...) then it invalidates its cache.
> > 
> > In the case of blocks map changes, could that mean a difference in
> > the
> > observable sparse regions of the file? If so, then a READ_PLUS
> > before
> > the change and a READ_PLUS after could give different results.
> > Since
> > that difference is observable by the client, I'd think we'd want to
> > bump
> > i_version for that anyway.
> 
> How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I know
> that's been the subject of recent debate.  At least as far as XFS is
> concerned, a file range can go from hole -> delayed allocation
> reservation -> unwritten extent -> (actual writeback) -> written
> extent.
> The dance became rather more complex when we added COW.  If any of
> that
> will make a difference for READ_PLUS, then yes, I think you'd want
> file
> writeback activities to bump iversion to cause client invalidations,
> like (I think) Dave said.
> 
> The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for
> written and delalloc extents; and an unwritten extent will report
> data
> for any pagecache it finds.
> 

READ_PLUS should never return anything different than a read() system
call would return for any given area. The way it reports sparse regions
vs. data regions is purely an RPC formatting convenience.

The only point to note about NFS READ and READ_PLUS is that because the
client is forced to send multiple RPC calls if the user is trying to
read a region that is larger than the 'rsize' value, it is possible
that these READ/READ_PLUS calls may be processed out of order, and so
the result may end up looking different than if you had executed a
read() call for the full region directly on the server.
However each individual READ / READ_PLUS reply should look as if the
user had called read() on that rsize-sized section of the file.
> > >
Jeff Layton Aug. 27, 2022, 4:10 p.m. UTC | #6
On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote:
> On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > > <amir73il@gmail.com> wrote:
> > > > > 
> > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > > <jlayton@kernel.org> wrote:
> > > > > > 
> > > > > > xfs will update the i_version when updating only the atime
> > > > > > value, which
> > > > > > is not desirable for any of the current consumers of
> > > > > > i_version. Doing so
> > > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > > measurement
> > > > > > activity in IMA.
> > > > > > 
> > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that
> > > > > > the
> > > > > > transaction should not update the i_version. Set that value
> > > > > > in
> > > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > > 
> > > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > > Cc: NeilBrown <neilb@suse.de>
> > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > > ---
> > > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > 
> > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to
> > > > > > illustrate
> > > > > > the problem. I still think this approach should at least fix
> > > > > > the worst
> > > > > > problems with atime updates being counted. We can look to
> > > > > > carve out
> > > > > > other "spurious" i_version updates as we identify them.
> > > > > > 
> > > > > 
> > > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > > writeback
> > > > > of dirty pages. Anybody know about other cases?
> > > > > 
> > > > > Regarding inode blocks map changes, first of all, I don't think
> > > > > that there is
> > > > > any practical loss from invalidating NFS client cache on dirty
> > > > > data writeback,
> > > > > because NFS server should be serving cold data most of the
> > > > > time.
> > > > > If there are a few unneeded cache invalidations they would only
> > > > > be temporary.
> > > > > 
> > > > 
> > > > Unless there is an issue with a writer NFS client that
> > > > invalidates its
> > > > own attribute
> > > > caches on server data writeback?
> > > > 
> > > 
> > > The client just looks at the file attributes (of which i_version is
> > > but
> > > one), and if certain attributes have changed (mtime, ctime,
> > > i_version,
> > > etc...) then it invalidates its cache.
> > > 
> > > In the case of blocks map changes, could that mean a difference in
> > > the
> > > observable sparse regions of the file? If so, then a READ_PLUS
> > > before
> > > the change and a READ_PLUS after could give different results.
> > > Since
> > > that difference is observable by the client, I'd think we'd want to
> > > bump
> > > i_version for that anyway.
> > 
> > How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I know
> > that's been the subject of recent debate.  At least as far as XFS is
> > concerned, a file range can go from hole -> delayed allocation
> > reservation -> unwritten extent -> (actual writeback) -> written
> > extent.
> > The dance became rather more complex when we added COW.  If any of
> > that
> > will make a difference for READ_PLUS, then yes, I think you'd want
> > file
> > writeback activities to bump iversion to cause client invalidations,
> > like (I think) Dave said.
> > 
> > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for
> > written and delalloc extents; and an unwritten extent will report
> > data
> > for any pagecache it finds.
> > 
> 
> READ_PLUS should never return anything different than a read() system
> call would return for any given area. The way it reports sparse regions
> vs. data regions is purely an RPC formatting convenience.
> 
> The only point to note about NFS READ and READ_PLUS is that because the
> client is forced to send multiple RPC calls if the user is trying to
> read a region that is larger than the 'rsize' value, it is possible
> that these READ/READ_PLUS calls may be processed out of order, and so
> the result may end up looking different than if you had executed a
> read() call for the full region directly on the server.
> However each individual READ / READ_PLUS reply should look as if the
> user had called read() on that rsize-sized section of the file.
> > > 

Yeah, thinking about it some more, simply changing the block allocation
is not something that should affect the ctime, so we probably don't want
to bump i_version on it. It's an implicit change, IOW, not an explicit
one.

The fact that xfs might do that is unfortunate, but it's not the end of
the world and it still would conform to the proposed definition for
i_version. In practice, this sort of allocation change should come soon
after the file was written, so one would hope that any damage due to the
false i_version bump would be minimized.

It would be nice to teach it not to do that however. Maybe we can insert
the NOIVER flag at a strategic place to avoid it?
Trond Myklebust Aug. 27, 2022, 5:06 p.m. UTC | #7
On Sat, 2022-08-27 at 12:10 -0400, Jeff Layton wrote:
> On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote:
> > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > > > <amir73il@gmail.com> wrote:
> > > > > > 
> > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > > > <jlayton@kernel.org> wrote:
> > > > > > > 
> > > > > > > xfs will update the i_version when updating only the
> > > > > > > atime
> > > > > > > value, which
> > > > > > > is not desirable for any of the current consumers of
> > > > > > > i_version. Doing so
> > > > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > > > measurement
> > > > > > > activity in IMA.
> > > > > > > 
> > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate
> > > > > > > that
> > > > > > > the
> > > > > > > transaction should not update the i_version. Set that
> > > > > > > value
> > > > > > > in
> > > > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > > > 
> > > > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > > > Cc: NeilBrown <neilb@suse.de>
> > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > > > ---
> > > > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > > 
> > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way
> > > > > > > to
> > > > > > > illustrate
> > > > > > > the problem. I still think this approach should at least
> > > > > > > fix
> > > > > > > the worst
> > > > > > > problems with atime updates being counted. We can look to
> > > > > > > carve out
> > > > > > > other "spurious" i_version updates as we identify them.
> > > > > > > 
> > > > > > 
> > > > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > > > writeback
> > > > > > of dirty pages. Anybody know about other cases?
> > > > > > 
> > > > > > Regarding inode blocks map changes, first of all, I don't
> > > > > > think
> > > > > > that there is
> > > > > > any practical loss from invalidating NFS client cache on
> > > > > > dirty
> > > > > > data writeback,
> > > > > > because NFS server should be serving cold data most of the
> > > > > > time.
> > > > > > If there are a few unneeded cache invalidations they would
> > > > > > only
> > > > > > be temporary.
> > > > > > 
> > > > > 
> > > > > Unless there is an issue with a writer NFS client that
> > > > > invalidates its
> > > > > own attribute
> > > > > caches on server data writeback?
> > > > > 
> > > > 
> > > > The client just looks at the file attributes (of which
> > > > i_version is
> > > > but
> > > > one), and if certain attributes have changed (mtime, ctime,
> > > > i_version,
> > > > etc...) then it invalidates its cache.
> > > > 
> > > > In the case of blocks map changes, could that mean a difference
> > > > in
> > > > the
> > > > observable sparse regions of the file? If so, then a READ_PLUS
> > > > before
> > > > the change and a READ_PLUS after could give different results.
> > > > Since
> > > > that difference is observable by the client, I'd think we'd
> > > > want to
> > > > bump
> > > > i_version for that anyway.
> > > 
> > > How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I
> > > know
> > > that's been the subject of recent debate.  At least as far as XFS
> > > is
> > > concerned, a file range can go from hole -> delayed allocation
> > > reservation -> unwritten extent -> (actual writeback) -> written
> > > extent.
> > > The dance became rather more complex when we added COW.  If any
> > > of
> > > that
> > > will make a difference for READ_PLUS, then yes, I think you'd
> > > want
> > > file
> > > writeback activities to bump iversion to cause client
> > > invalidations,
> > > like (I think) Dave said.
> > > 
> > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data
> > > for
> > > written and delalloc extents; and an unwritten extent will report
> > > data
> > > for any pagecache it finds.
> > > 
> > 
> > READ_PLUS should never return anything different than a read()
> > system
> > call would return for any given area. The way it reports sparse
> > regions
> > vs. data regions is purely an RPC formatting convenience.
> > 
> > The only point to note about NFS READ and READ_PLUS is that because
> > the
> > client is forced to send multiple RPC calls if the user is trying
> > to
> > read a region that is larger than the 'rsize' value, it is possible
> > that these READ/READ_PLUS calls may be processed out of order, and
> > so
> > the result may end up looking different than if you had executed a
> > read() call for the full region directly on the server.
> > However each individual READ / READ_PLUS reply should look as if
> > the
> > user had called read() on that rsize-sized section of the file.
> > > > 
> 
> Yeah, thinking about it some more, simply changing the block
> allocation
> is not something that should affect the ctime, so we probably don't
> want
> to bump i_version on it. It's an implicit change, IOW, not an
> explicit
> one.

As you say, it is unfortunate that XFS does this, and it is unfortunate
that it then changes the blocks allocated attribute post-fsync(), since
all that does cause confusion for certain applications.
However I agree 100% that this is an implicit change that is driven by
the filesystem and not the user application. Hence it is not an action
that needs to be recorded with a change attribute bump.
Amir Goldstein Aug. 28, 2022, 1:25 p.m. UTC | #8
On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote:
> > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > > > <amir73il@gmail.com> wrote:
> > > > > >
> > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > > > <jlayton@kernel.org> wrote:
> > > > > > >
> > > > > > > xfs will update the i_version when updating only the atime
> > > > > > > value, which
> > > > > > > is not desirable for any of the current consumers of
> > > > > > > i_version. Doing so
> > > > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > > > measurement
> > > > > > > activity in IMA.
> > > > > > >
> > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that
> > > > > > > the
> > > > > > > transaction should not update the i_version. Set that value
> > > > > > > in
> > > > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > > >
> > > > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > > > Cc: NeilBrown <neilb@suse.de>
> > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > > > ---
> > > > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > >
> > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to
> > > > > > > illustrate
> > > > > > > the problem. I still think this approach should at least fix
> > > > > > > the worst
> > > > > > > problems with atime updates being counted. We can look to
> > > > > > > carve out
> > > > > > > other "spurious" i_version updates as we identify them.
> > > > > > >
> > > > > >
> > > > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > > > writeback
> > > > > > of dirty pages. Anybody know about other cases?
> > > > > >
> > > > > > Regarding inode blocks map changes, first of all, I don't think
> > > > > > that there is
> > > > > > any practical loss from invalidating NFS client cache on dirty
> > > > > > data writeback,
> > > > > > because NFS server should be serving cold data most of the
> > > > > > time.
> > > > > > If there are a few unneeded cache invalidations they would only
> > > > > > be temporary.
> > > > > >
> > > > >
> > > > > Unless there is an issue with a writer NFS client that
> > > > > invalidates its
> > > > > own attribute
> > > > > caches on server data writeback?
> > > > >
> > > >
> > > > The client just looks at the file attributes (of which i_version is
> > > > but
> > > > one), and if certain attributes have changed (mtime, ctime,
> > > > i_version,
> > > > etc...) then it invalidates its cache.
> > > >
> > > > In the case of blocks map changes, could that mean a difference in
> > > > the
> > > > observable sparse regions of the file? If so, then a READ_PLUS
> > > > before
> > > > the change and a READ_PLUS after could give different results.
> > > > Since
> > > > that difference is observable by the client, I'd think we'd want to
> > > > bump
> > > > i_version for that anyway.
> > >
> > > How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I know
> > > that's been the subject of recent debate.  At least as far as XFS is
> > > concerned, a file range can go from hole -> delayed allocation
> > > reservation -> unwritten extent -> (actual writeback) -> written
> > > extent.
> > > The dance became rather more complex when we added COW.  If any of
> > > that
> > > will make a difference for READ_PLUS, then yes, I think you'd want
> > > file
> > > writeback activities to bump iversion to cause client invalidations,
> > > like (I think) Dave said.
> > >
> > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for
> > > written and delalloc extents; and an unwritten extent will report
> > > data
> > > for any pagecache it finds.
> > >
> >
> > READ_PLUS should never return anything different than a read() system
> > call would return for any given area. The way it reports sparse regions
> > vs. data regions is purely an RPC formatting convenience.
> >
> > The only point to note about NFS READ and READ_PLUS is that because the
> > client is forced to send multiple RPC calls if the user is trying to
> > read a region that is larger than the 'rsize' value, it is possible
> > that these READ/READ_PLUS calls may be processed out of order, and so
> > the result may end up looking different than if you had executed a
> > read() call for the full region directly on the server.
> > However each individual READ / READ_PLUS reply should look as if the
> > user had called read() on that rsize-sized section of the file.
> > > >
>
> Yeah, thinking about it some more, simply changing the block allocation
> is not something that should affect the ctime, so we probably don't want
> to bump i_version on it. It's an implicit change, IOW, not an explicit
> one.
>
> The fact that xfs might do that is unfortunate, but it's not the end of
> the world and it still would conform to the proposed definition for
> i_version. In practice, this sort of allocation change should come soon
> after the file was written, so one would hope that any damage due to the
> false i_version bump would be minimized.
>

That was exactly my point.

> It would be nice to teach it not to do that however. Maybe we can insert
> the NOIVER flag at a strategic place to avoid it?

Why would that be nice to avoid?
You did not specify any use case where incrementing i_version
on block mapping change matters in practice.
On the contrary, you said that NFS client writer sends COMMIT on close,
which should stabilize i_version for the next readers.

Given that we already have an xfs implementation that does increment
i_version on block mapping changes and it would be a pain to change
that or add a new user options, I don't see the point in discussing it further
unless there is a good incentive for avoiding i_version updates in those cases.

Thanks,
Amir.
Jeff Layton Aug. 28, 2022, 2:37 p.m. UTC | #9
On Sun, 2022-08-28 at 16:25 +0300, Amir Goldstein wrote:
> On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote:
> > > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> > > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > > > > <amir73il@gmail.com> wrote:
> > > > > > > 
> > > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > > > > <jlayton@kernel.org> wrote:
> > > > > > > > 
> > > > > > > > xfs will update the i_version when updating only the atime
> > > > > > > > value, which
> > > > > > > > is not desirable for any of the current consumers of
> > > > > > > > i_version. Doing so
> > > > > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > > > > measurement
> > > > > > > > activity in IMA.
> > > > > > > > 
> > > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that
> > > > > > > > the
> > > > > > > > transaction should not update the i_version. Set that value
> > > > > > > > in
> > > > > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > > > > 
> > > > > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > > > > Cc: NeilBrown <neilb@suse.de>
> > > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > > > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > > > > ---
> > > > > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > > > 
> > > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to
> > > > > > > > illustrate
> > > > > > > > the problem. I still think this approach should at least fix
> > > > > > > > the worst
> > > > > > > > problems with atime updates being counted. We can look to
> > > > > > > > carve out
> > > > > > > > other "spurious" i_version updates as we identify them.
> > > > > > > > 
> > > > > > > 
> > > > > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > > > > writeback
> > > > > > > of dirty pages. Anybody know about other cases?
> > > > > > > 
> > > > > > > Regarding inode blocks map changes, first of all, I don't think
> > > > > > > that there is
> > > > > > > any practical loss from invalidating NFS client cache on dirty
> > > > > > > data writeback,
> > > > > > > because NFS server should be serving cold data most of the
> > > > > > > time.
> > > > > > > If there are a few unneeded cache invalidations they would only
> > > > > > > be temporary.
> > > > > > > 
> > > > > > 
> > > > > > Unless there is an issue with a writer NFS client that
> > > > > > invalidates its
> > > > > > own attribute
> > > > > > caches on server data writeback?
> > > > > > 
> > > > > 
> > > > > The client just looks at the file attributes (of which i_version is
> > > > > but
> > > > > one), and if certain attributes have changed (mtime, ctime,
> > > > > i_version,
> > > > > etc...) then it invalidates its cache.
> > > > > 
> > > > > In the case of blocks map changes, could that mean a difference in
> > > > > the
> > > > > observable sparse regions of the file? If so, then a READ_PLUS
> > > > > before
> > > > > the change and a READ_PLUS after could give different results.
> > > > > Since
> > > > > that difference is observable by the client, I'd think we'd want to
> > > > > bump
> > > > > i_version for that anyway.
> > > > 
> > > > How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I know
> > > > that's been the subject of recent debate.  At least as far as XFS is
> > > > concerned, a file range can go from hole -> delayed allocation
> > > > reservation -> unwritten extent -> (actual writeback) -> written
> > > > extent.
> > > > The dance became rather more complex when we added COW.  If any of
> > > > that
> > > > will make a difference for READ_PLUS, then yes, I think you'd want
> > > > file
> > > > writeback activities to bump iversion to cause client invalidations,
> > > > like (I think) Dave said.
> > > > 
> > > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for
> > > > written and delalloc extents; and an unwritten extent will report
> > > > data
> > > > for any pagecache it finds.
> > > > 
> > > 
> > > READ_PLUS should never return anything different than a read() system
> > > call would return for any given area. The way it reports sparse regions
> > > vs. data regions is purely an RPC formatting convenience.
> > > 
> > > The only point to note about NFS READ and READ_PLUS is that because the
> > > client is forced to send multiple RPC calls if the user is trying to
> > > read a region that is larger than the 'rsize' value, it is possible
> > > that these READ/READ_PLUS calls may be processed out of order, and so
> > > the result may end up looking different than if you had executed a
> > > read() call for the full region directly on the server.
> > > However each individual READ / READ_PLUS reply should look as if the
> > > user had called read() on that rsize-sized section of the file.
> > > > > 
> > 
> > Yeah, thinking about it some more, simply changing the block allocation
> > is not something that should affect the ctime, so we probably don't want
> > to bump i_version on it. It's an implicit change, IOW, not an explicit
> > one.
> > 
> > The fact that xfs might do that is unfortunate, but it's not the end of
> > the world and it still would conform to the proposed definition for
> > i_version. In practice, this sort of allocation change should come soon
> > after the file was written, so one would hope that any damage due to the
> > false i_version bump would be minimized.
> > 
> 
> That was exactly my point.
> 
> > It would be nice to teach it not to do that however. Maybe we can insert
> > the NOIVER flag at a strategic place to avoid it?
> 
> Why would that be nice to avoid?
> You did not specify any use case where incrementing i_version
> on block mapping change matters in practice.
> On the contrary, you said that NFS client writer sends COMMIT on close,
> which should stabilize i_version for the next readers.
> 
> Given that we already have an xfs implementation that does increment
> i_version on block mapping changes and it would be a pain to change
> that or add a new user options, I don't see the point in discussing it further
> unless there is a good incentive for avoiding i_version updates in those cases.
> 

Because the change to the block allocation doesn't represent an
"explicit" change to the inode. We will have bumped the ctime on the
original write (in update_time), but the follow-on changes that occur
due to that write needn't be counted as they aren't visible to the
client.

It's possible for a client to issue a read between the write and the
flush and get the interim value for i_version. Then, once the write
happens and the i_version gets bumped again, the client invalidates its
cache even though it needn't do so.

The race window ought to be relatively small, and this wouldn't result
in incorrect behavior that you'd notice (other than loss of
performance), but it's not ideal. We're doing more on-the-wire reads
than are necessary in this case.

It would be nice to have it not do that. If we end up taking this patch
to make it elide the i_version bumps on atime updates, we may be able to
set the the NOIVER flag in other cases as well, and avoid some of these
extra bumps.
Amir Goldstein Aug. 28, 2022, 4:53 p.m. UTC | #10
On Sun, Aug 28, 2022 at 5:37 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Sun, 2022-08-28 at 16:25 +0300, Amir Goldstein wrote:
> > On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote:
> > > > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> > > > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > > > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > > > > > <amir73il@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > > > > > <jlayton@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > xfs will update the i_version when updating only the atime
> > > > > > > > > value, which
> > > > > > > > > is not desirable for any of the current consumers of
> > > > > > > > > i_version. Doing so
> > > > > > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > > > > > measurement
> > > > > > > > > activity in IMA.
> > > > > > > > >
> > > > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that
> > > > > > > > > the
> > > > > > > > > transaction should not update the i_version. Set that value
> > > > > > > > > in
> > > > > > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > > > > >
> > > > > > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > > > > > Cc: NeilBrown <neilb@suse.de>
> > > > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > > > > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > > > > > ---
> > > > > > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > > > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > > > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > > > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > > > >
> > > > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to
> > > > > > > > > illustrate
> > > > > > > > > the problem. I still think this approach should at least fix
> > > > > > > > > the worst
> > > > > > > > > problems with atime updates being counted. We can look to
> > > > > > > > > carve out
> > > > > > > > > other "spurious" i_version updates as we identify them.
> > > > > > > > >
> > > > > > > >
> > > > > > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > > > > > writeback
> > > > > > > > of dirty pages. Anybody know about other cases?
> > > > > > > >
> > > > > > > > Regarding inode blocks map changes, first of all, I don't think
> > > > > > > > that there is
> > > > > > > > any practical loss from invalidating NFS client cache on dirty
> > > > > > > > data writeback,
> > > > > > > > because NFS server should be serving cold data most of the
> > > > > > > > time.
> > > > > > > > If there are a few unneeded cache invalidations they would only
> > > > > > > > be temporary.
> > > > > > > >
> > > > > > >
> > > > > > > Unless there is an issue with a writer NFS client that
> > > > > > > invalidates its
> > > > > > > own attribute
> > > > > > > caches on server data writeback?
> > > > > > >
> > > > > >
> > > > > > The client just looks at the file attributes (of which i_version is
> > > > > > but
> > > > > > one), and if certain attributes have changed (mtime, ctime,
> > > > > > i_version,
> > > > > > etc...) then it invalidates its cache.
> > > > > >
> > > > > > In the case of blocks map changes, could that mean a difference in
> > > > > > the
> > > > > > observable sparse regions of the file? If so, then a READ_PLUS
> > > > > > before
> > > > > > the change and a READ_PLUS after could give different results.
> > > > > > Since
> > > > > > that difference is observable by the client, I'd think we'd want to
> > > > > > bump
> > > > > > i_version for that anyway.
> > > > >
> > > > > How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I know
> > > > > that's been the subject of recent debate.  At least as far as XFS is
> > > > > concerned, a file range can go from hole -> delayed allocation
> > > > > reservation -> unwritten extent -> (actual writeback) -> written
> > > > > extent.
> > > > > The dance became rather more complex when we added COW.  If any of
> > > > > that
> > > > > will make a difference for READ_PLUS, then yes, I think you'd want
> > > > > file
> > > > > writeback activities to bump iversion to cause client invalidations,
> > > > > like (I think) Dave said.
> > > > >
> > > > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for
> > > > > written and delalloc extents; and an unwritten extent will report
> > > > > data
> > > > > for any pagecache it finds.
> > > > >
> > > >
> > > > READ_PLUS should never return anything different than a read() system
> > > > call would return for any given area. The way it reports sparse regions
> > > > vs. data regions is purely an RPC formatting convenience.
> > > >
> > > > The only point to note about NFS READ and READ_PLUS is that because the
> > > > client is forced to send multiple RPC calls if the user is trying to
> > > > read a region that is larger than the 'rsize' value, it is possible
> > > > that these READ/READ_PLUS calls may be processed out of order, and so
> > > > the result may end up looking different than if you had executed a
> > > > read() call for the full region directly on the server.
> > > > However each individual READ / READ_PLUS reply should look as if the
> > > > user had called read() on that rsize-sized section of the file.
> > > > > >
> > >
> > > Yeah, thinking about it some more, simply changing the block allocation
> > > is not something that should affect the ctime, so we probably don't want
> > > to bump i_version on it. It's an implicit change, IOW, not an explicit
> > > one.
> > >
> > > The fact that xfs might do that is unfortunate, but it's not the end of
> > > the world and it still would conform to the proposed definition for
> > > i_version. In practice, this sort of allocation change should come soon
> > > after the file was written, so one would hope that any damage due to the
> > > false i_version bump would be minimized.
> > >
> >
> > That was exactly my point.
> >
> > > It would be nice to teach it not to do that however. Maybe we can insert
> > > the NOIVER flag at a strategic place to avoid it?
> >
> > Why would that be nice to avoid?
> > You did not specify any use case where incrementing i_version
> > on block mapping change matters in practice.
> > On the contrary, you said that NFS client writer sends COMMIT on close,
> > which should stabilize i_version for the next readers.
> >
> > Given that we already have an xfs implementation that does increment
> > i_version on block mapping changes and it would be a pain to change
> > that or add a new user options, I don't see the point in discussing it further
> > unless there is a good incentive for avoiding i_version updates in those cases.
> >
>
> Because the change to the block allocation doesn't represent an
> "explicit" change to the inode. We will have bumped the ctime on the
> original write (in update_time), but the follow-on changes that occur
> due to that write needn't be counted as they aren't visible to the
> client.
>
> It's possible for a client to issue a read between the write and the
> flush and get the interim value for i_version. Then, once the write
> happens and the i_version gets bumped again, the client invalidates its
> cache even though it needn't do so.
>
> The race window ought to be relatively small, and this wouldn't result
> in incorrect behavior that you'd notice (other than loss of
> performance), but it's not ideal. We're doing more on-the-wire reads
> than are necessary in this case.
>
> It would be nice to have it not do that. If we end up taking this patch
> to make it elide the i_version bumps on atime updates, we may be able to
> set the the NOIVER flag in other cases as well, and avoid some of these
> extra bumps.

Ok. The use case is that a client's writes on an open file can end up
invalidating the same client's read cache on the same file.
Sounds like implementing write delegations would have been better
in this case than relying on i_version, but I do not know how hard that is
and how many clients would request write delegations.

Anyway, if you ever send a proposal to change the semantics of
xfs i_version bumping on writeback, please specify the target use case
and how likely it is for real workloads to suffer from it.

Another question to the NFS experts - what happens when the
server crashes before dirty data hits persistent storage but the client
has already cached the dirty data.

After the crash, the filesystem can come up with the same i_version
that the NFS client has cached, but without the dirty data.
Does the spec address this scenario?

Thanks,
Amir.
Amir Goldstein Aug. 28, 2022, 5:30 p.m. UTC | #11
> > > >
> > > > Forensics and other applications that care about atime updates can and
> > > > should check atime and don't need i_version to know that it was changed.
> > > > The reliability of atime as an audit tool has dropped considerably since
> > > > the default in relatime.
>
> I've been waiting for Amir to appear in this discussion -- ISTR that a
> few years ago you were wanting the ability to scan a filesystem to look
> for files that have changed since a given point.  If XFS exported its
> di_changecount file attribute (as it currently behaves) via BULKSTAT,
> you'd have the ability to do that, so long as your application could
> persist bulkstat data and compare.
>

It's true that exporting i_version via BULKSTAT could be useful
to some backup/sync applications.

For my case, I was interested in something a bit different -
finding all the changes in a very large fs tree - or IOW finding if
anything has changed inside a large tree since a point in time.
So not sure if i_version would help for that use case.

Thanks,
Amir.
Dave Chinner Aug. 29, 2022, 5:48 a.m. UTC | #12
On Sun, Aug 28, 2022 at 10:37:37AM -0400, Jeff Layton wrote:
> On Sun, 2022-08-28 at 16:25 +0300, Amir Goldstein wrote:
> > On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > Yeah, thinking about it some more, simply changing the block allocation
> > > is not something that should affect the ctime, so we probably don't want
> > > to bump i_version on it. It's an implicit change, IOW, not an explicit
> > > one.
> > > 
> > > The fact that xfs might do that is unfortunate, but it's not the end of
> > > the world and it still would conform to the proposed definition for
> > > i_version. In practice, this sort of allocation change should come soon
> > > after the file was written, so one would hope that any damage due to the
> > > false i_version bump would be minimized.
> > > 
> > 
> > That was exactly my point.
> > 
> > > It would be nice to teach it not to do that however. Maybe we can insert
> > > the NOIVER flag at a strategic place to avoid it?

No, absolutely not.

I've already explained this: The XFS *disk format specification*
says that di_changecount is bumped for every change that is made to
the inode.

Applications that are written from this specification expect the on
disk format for a XFS given filesystem feature to remain the same
until it is either deprecated and removed or we add feature flags to
indicate it has different behaviour.  We can't just change the
behaviour at a whim.

And that's ignoring the fact that randomly spewing NOIVER
into transactions that modify inode metadata is a nasty hack - it
is not desirable from a design or documentation POV, nor is it
maintainable.

> > Why would that be nice to avoid?
> > You did not specify any use case where incrementing i_version
> > on block mapping change matters in practice.
> > On the contrary, you said that NFS client writer sends COMMIT on close,
> > which should stabilize i_version for the next readers.
> > 
> > Given that we already have an xfs implementation that does increment
> > i_version on block mapping changes and it would be a pain to change
> > that or add a new user options, I don't see the point in discussing it further
> > unless there is a good incentive for avoiding i_version updates in those cases.
> > 
> 
> Because the change to the block allocation doesn't represent an
> "explicit" change to the inode. We will have bumped the ctime on the
> original write (in update_time), but the follow-on changes that occur
> due to that write needn't be counted as they aren't visible to the
> client.
> 
> It's possible for a client to issue a read between the write and the
> flush and get the interim value for i_version. Then, once the write
> happens and the i_version gets bumped again, the client invalidates its
> cache even though it needn't do so.
> 
> The race window ought to be relatively small, and this wouldn't result
> in incorrect behavior that you'd notice (other than loss of
> performance), but it's not ideal. We're doing more on-the-wire reads
> than are necessary in this case.
> 
> It would be nice to have it not do that. If we end up taking this patch
> to make it elide the i_version bumps on atime updates, we may be able to
> set the the NOIVER flag in other cases as well, and avoid some of these
> extra bumps.


<sigh>

Please don't make me repeat myself for the third time.

Once we have decided on a solid, unchanging definition for the
*statx user API variable*, we'll implement a new on-disk field that
provides this information.  We will document it in the on-disk
specification as "this is how di_iversion behaves" so that it is
clear to everyone parsing the on-disk format or writing their own
XFS driver how to implement it and when to expect it to
change.

Then we can add a filesystem and inode feature flags that say "inode
has new iversion" and we use that to populate the kernel iversion
instead of di_changecount. We keep di_changecount exactly the way it
is now for the applications and use cases we already have for that
specific behaviour. If the kernel and/or filesystem don't support
the new di_iversion field, then we'll use di_changecount as it
currently exists for the kernel iversion code.

Keep in mind that we've been doing dynamic inode format updates in
XFS for a couple of decades - users don't even have to be aware that
they need to perform format upgrades because often they just happen
whenever an inode is accessed. IOWs, just because we have to change
the on-disk format to support this new iversion definition, it
doesn't mean users have to reformat filesystems before the new
feature can be used.

Hence, over time, as distros update kernels, the XFS iversion
behaviour will change automagically as we update inodes in existing
filesystems as they are accessed to add and then use the new
di_iversion field for the VFS change attribute field instead of the
di_changecount field...

-Dave.
Jeff Layton Aug. 29, 2022, 10:33 a.m. UTC | #13
On Mon, 2022-08-29 at 15:48 +1000, Dave Chinner wrote:
> On Sun, Aug 28, 2022 at 10:37:37AM -0400, Jeff Layton wrote:
> > On Sun, 2022-08-28 at 16:25 +0300, Amir Goldstein wrote:
> > > On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > Yeah, thinking about it some more, simply changing the block allocation
> > > > is not something that should affect the ctime, so we probably don't want
> > > > to bump i_version on it. It's an implicit change, IOW, not an explicit
> > > > one.
> > > > 
> > > > The fact that xfs might do that is unfortunate, but it's not the end of
> > > > the world and it still would conform to the proposed definition for
> > > > i_version. In practice, this sort of allocation change should come soon
> > > > after the file was written, so one would hope that any damage due to the
> > > > false i_version bump would be minimized.
> > > > 
> > > 
> > > That was exactly my point.
> > > 
> > > > It would be nice to teach it not to do that however. Maybe we can insert
> > > > the NOIVER flag at a strategic place to avoid it?
> 
> No, absolutely not.
> 
> I've already explained this: The XFS *disk format specification*
> says that di_changecount is bumped for every change that is made to
> the inode.
> 
> Applications that are written from this specification expect the on
> disk format for a XFS given filesystem feature to remain the same
> until it is either deprecated and removed or we add feature flags to
> indicate it has different behaviour.  We can't just change the
> behaviour at a whim.
> 
> And that's ignoring the fact that randomly spewing NOIVER
> into transactions that modify inode metadata is a nasty hack - it
> is not desirable from a design or documentation POV, nor is it
> maintainable.
> 
> > > Why would that be nice to avoid?
> > > You did not specify any use case where incrementing i_version
> > > on block mapping change matters in practice.
> > > On the contrary, you said that NFS client writer sends COMMIT on close,
> > > which should stabilize i_version for the next readers.
> > > 
> > > Given that we already have an xfs implementation that does increment
> > > i_version on block mapping changes and it would be a pain to change
> > > that or add a new user options, I don't see the point in discussing it further
> > > unless there is a good incentive for avoiding i_version updates in those cases.
> > > 
> > 
> > Because the change to the block allocation doesn't represent an
> > "explicit" change to the inode. We will have bumped the ctime on the
> > original write (in update_time), but the follow-on changes that occur
> > due to that write needn't be counted as they aren't visible to the
> > client.
> > 
> > It's possible for a client to issue a read between the write and the
> > flush and get the interim value for i_version. Then, once the write
> > happens and the i_version gets bumped again, the client invalidates its
> > cache even though it needn't do so.
> > 
> > The race window ought to be relatively small, and this wouldn't result
> > in incorrect behavior that you'd notice (other than loss of
> > performance), but it's not ideal. We're doing more on-the-wire reads
> > than are necessary in this case.
> > 
> > It would be nice to have it not do that. If we end up taking this patch
> > to make it elide the i_version bumps on atime updates, we may be able to
> > set the the NOIVER flag in other cases as well, and avoid some of these
> > extra bumps.
> 
> 
> <sigh>
> 
> Please don't make me repeat myself for the third time.
> 
> Once we have decided on a solid, unchanging definition for the
> *statx user API variable*, we'll implement a new on-disk field that
> provides this information.  We will document it in the on-disk
> specification as "this is how di_iversion behaves" so that it is
> clear to everyone parsing the on-disk format or writing their own
> XFS driver how to implement it and when to expect it to
> change.
> 
> Then we can add a filesystem and inode feature flags that say "inode
> has new iversion" and we use that to populate the kernel iversion
> instead of di_changecount. We keep di_changecount exactly the way it
> is now for the applications and use cases we already have for that
> specific behaviour. If the kernel and/or filesystem don't support
> the new di_iversion field, then we'll use di_changecount as it
> currently exists for the kernel iversion code.
> 

Aside from NFS and IMA, what applications are dependent on the current
definition and how do they rely on i_version today?

> Keep in mind that we've been doing dynamic inode format updates in
> XFS for a couple of decades - users don't even have to be aware that
> they need to perform format upgrades because often they just happen
> whenever an inode is accessed. IOWs, just because we have to change
> the on-disk format to support this new iversion definition, it
> doesn't mean users have to reformat filesystems before the new
> feature can be used.
> 
> Hence, over time, as distros update kernels, the XFS iversion
> behaviour will change automagically as we update inodes in existing
> filesystems as they are accessed to add and then use the new
> di_iversion field for the VFS change attribute field instead of the
> di_changecount field...
> 

If you want to create a whole new on-disk field for this, then that's
your prerogative, but before you do that, I'd like to better understand
why and how the constraints on this field changed.

The original log message from the commit that added a change counter
(below) stated that you were adding it for network filesystems like NFS.
When did this change and why?

    commit dc037ad7d24f3711e431a45c053b5d425995e9e4
    Author: Dave Chinner <dchinner@redhat.com>
    Date:   Thu Jun 27 16:04:59 2013 +1000

        xfs: implement inode change count

        For CRC enabled filesystems, add support for the monotonic inode
        version change counter that is needed by protocols like NFSv4 for
        determining if the inode has changed in any way at all between two
        unrelated operations on the inode.

        This bumps the change count the first time an inode is dirtied in a
        transaction. Since all modifications to the inode are logged, this
        will catch all changes that are made to the inode, including
        timestamp updates that occur during data writes.

        Signed-off-by: Dave Chinner <dchinner@redhat.com>
        Reviewed-by: Mark Tinguely <tinguely@sgi.com>
        Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com>
        Signed-off-by: Ben Myers <bpm@sgi.com>
Dave Chinner Aug. 30, 2022, 12:08 a.m. UTC | #14
On Mon, Aug 29, 2022 at 06:33:48AM -0400, Jeff Layton wrote:
> On Mon, 2022-08-29 at 15:48 +1000, Dave Chinner wrote:
> > > 
> > > The race window ought to be relatively small, and this wouldn't result
> > > in incorrect behavior that you'd notice (other than loss of
> > > performance), but it's not ideal. We're doing more on-the-wire reads
> > > than are necessary in this case.
> > > 
> > > It would be nice to have it not do that. If we end up taking this patch
> > > to make it elide the i_version bumps on atime updates, we may be able to
> > > set the the NOIVER flag in other cases as well, and avoid some of these
> > > extra bumps.
> > 
> > 
> > <sigh>
> > 
> > Please don't make me repeat myself for the third time.
> > 
> > Once we have decided on a solid, unchanging definition for the
> > *statx user API variable*, we'll implement a new on-disk field that
> > provides this information.  We will document it in the on-disk
> > specification as "this is how di_iversion behaves" so that it is
> > clear to everyone parsing the on-disk format or writing their own
> > XFS driver how to implement it and when to expect it to
> > change.
> > 
> > Then we can add a filesystem and inode feature flags that say "inode
> > has new iversion" and we use that to populate the kernel iversion
> > instead of di_changecount. We keep di_changecount exactly the way it
> > is now for the applications and use cases we already have for that
> > specific behaviour. If the kernel and/or filesystem don't support
> > the new di_iversion field, then we'll use di_changecount as it
> > currently exists for the kernel iversion code.
> > 
> 
> Aside from NFS and IMA, what applications are dependent on the current
> definition and how do they rely on i_version today?

I've answered this multiple times already: the di_changecount
behaviour is defined in the on-disk specification and hence we
*cannot change the behaviour* without changing the on-disk format
specification.

Apart from the forensics aspect of the change counter (which nobody
but us XFS developers seem to understand just how damn important
this is), there are *many* third party applications that parse the
XFS on-disk format directly. This:

https://codesearch.debian.net/search?q=XFS_SB_VERSION_DIRV2&literal=1

Shows grub2, libparted, syslinux, partclone and fsarchiver as
knowing about XFS on-disk superblock flags that tell them what
format the directory structure is in. That alone is enough to
indicate they parse on-disk inodes directly, and hence may expect
di_changecount to have specific meaning and use it to detect
unexpected changes to files/directories they care about.

If I go looking for XFS_SB_MAGIC, I find things like libblkid,
klibc, qemu, Xen, testdisk, gpart, and virtualbox all parse the
on-disk superblocks directly from the block device, too. They also
rely directly on XFS developers ensuring there are no silent
incomaptible changes to the on disk format.

I also know of many other utilities that people and companies have
written that parse the on disk format directly from userspace. The
functions these perform include low level storage management tools,
copying and managing disk images (e.g. offline configuration for
cluster deployments), data recovery tools that scrape all the data
out of broken filesystems, etc.

These applications are reliant on the guarantee we provide that the
on-disk format will not silently change and that behaviour/structure
can always easily be discovered by feature flags in the superblock
and/or inodes.

IOWs, just because there aren't obvious "traditional" application on
top of the kernel filesystem that consumes the in-memory kernel
iversion field, it does not mean that the defined behaviour of the
on-disk di_changecount field is not used or relied on by other tools
that work directly on the on-disk format.

You might be right that NFS doesn't care about this, but the point
remains that NFS does not control the XFS on-disk format, nor does
the fact that what NFS wants from the change attribute has changed
over time override the fact that maintaining XFS on-disk format
compatibility is the responsibility of XFS developers. We're willing
to change the on-disk format to support whatever the new definition
of the statx change attribute ends up being, and that should be the
end of the discussion.

> > Keep in mind that we've been doing dynamic inode format updates in
> > XFS for a couple of decades - users don't even have to be aware that
> > they need to perform format upgrades because often they just happen
> > whenever an inode is accessed. IOWs, just because we have to change
> > the on-disk format to support this new iversion definition, it
> > doesn't mean users have to reformat filesystems before the new
> > feature can be used.
> > 
> > Hence, over time, as distros update kernels, the XFS iversion
> > behaviour will change automagically as we update inodes in existing
> > filesystems as they are accessed to add and then use the new
> > di_iversion field for the VFS change attribute field instead of the
> > di_changecount field...
> > 
> 
> If you want to create a whole new on-disk field for this, then that's
> your prerogative, but before you do that, I'd like to better understand
> why and how the constraints on this field changed.
> 
> The original log message from the commit that added a change counter
> (below) stated that you were adding it for network filesystems like NFS.
> When did this change and why?

It never changed. I'll repeat what I've already explained twice
before:

https://lore.kernel.org/linux-xfs/20220818030048.GE3600936@dread.disaster.area/
https://lore.kernel.org/linux-xfs/20220818033731.GF3600936@dread.disaster.area/

tl; dr: NFS requirements were just one of *many* we had at the time
for an atomic persistent change counter.

The fact is that NFS users are just going to have to put up with
random cache invalidations on XFS for a while longer. Nobody noticed
this and/or cared about this enough to raise it as an issue for the
past decade, so waiting another few months for upstream XFS to
change to a different on-disk format for the NFS/statx change
attribute isn't a big deal.

-Dave.
Jeff Layton Aug. 30, 2022, 11:20 a.m. UTC | #15
On Tue, 2022-08-30 at 10:08 +1000, Dave Chinner wrote:
> On Mon, Aug 29, 2022 at 06:33:48AM -0400, Jeff Layton wrote:
> > On Mon, 2022-08-29 at 15:48 +1000, Dave Chinner wrote:
> > > > 
> > > > The race window ought to be relatively small, and this wouldn't result
> > > > in incorrect behavior that you'd notice (other than loss of
> > > > performance), but it's not ideal. We're doing more on-the-wire reads
> > > > than are necessary in this case.
> > > > 
> > > > It would be nice to have it not do that. If we end up taking this patch
> > > > to make it elide the i_version bumps on atime updates, we may be able to
> > > > set the the NOIVER flag in other cases as well, and avoid some of these
> > > > extra bumps.
> > > 
> > > 
> > > <sigh>
> > > 
> > > Please don't make me repeat myself for the third time.
> > > 
> > > Once we have decided on a solid, unchanging definition for the
> > > *statx user API variable*, we'll implement a new on-disk field that
> > > provides this information.  We will document it in the on-disk
> > > specification as "this is how di_iversion behaves" so that it is
> > > clear to everyone parsing the on-disk format or writing their own
> > > XFS driver how to implement it and when to expect it to
> > > change.
> > > 
> > > Then we can add a filesystem and inode feature flags that say "inode
> > > has new iversion" and we use that to populate the kernel iversion
> > > instead of di_changecount. We keep di_changecount exactly the way it
> > > is now for the applications and use cases we already have for that
> > > specific behaviour. If the kernel and/or filesystem don't support
> > > the new di_iversion field, then we'll use di_changecount as it
> > > currently exists for the kernel iversion code.
> > > 
> > 
> > Aside from NFS and IMA, what applications are dependent on the current
> > definition and how do they rely on i_version today?
> 
> I've answered this multiple times already: the di_changecount
> behaviour is defined in the on-disk specification and hence we
> *cannot change the behaviour* without changing the on-disk format
> specification.
> 
> Apart from the forensics aspect of the change counter (which nobody
> but us XFS developers seem to understand just how damn important
> this is), there are *many* third party applications that parse the
> XFS on-disk format directly. This:
> 
> https://codesearch.debian.net/search?q=XFS_SB_VERSION_DIRV2&literal=1
> 
> Shows grub2, libparted, syslinux, partclone and fsarchiver as
> knowing about XFS on-disk superblock flags that tell them what
> format the directory structure is in. That alone is enough to
> indicate they parse on-disk inodes directly, and hence may expect
> di_changecount to have specific meaning and use it to detect
> unexpected changes to files/directories they care about.
> 
> If I go looking for XFS_SB_MAGIC, I find things like libblkid,
> klibc, qemu, Xen, testdisk, gpart, and virtualbox all parse the
> on-disk superblocks directly from the block device, too. They also
> rely directly on XFS developers ensuring there are no silent
> incomaptible changes to the on disk format.
> 
> I also know of many other utilities that people and companies have
> written that parse the on disk format directly from userspace. The
> functions these perform include low level storage management tools,
> copying and managing disk images (e.g. offline configuration for
> cluster deployments), data recovery tools that scrape all the data
> out of broken filesystems, etc.
> 
> These applications are reliant on the guarantee we provide that the
> on-disk format will not silently change and that behaviour/structure
> can always easily be discovered by feature flags in the superblock
> and/or inodes.
> 
> IOWs, just because there aren't obvious "traditional" application on
> top of the kernel filesystem that consumes the in-memory kernel
> iversion field, it does not mean that the defined behaviour of the
> on-disk di_changecount field is not used or relied on by other tools
> that work directly on the on-disk format.
> 
> You might be right that NFS doesn't care about this, but the point
> remains that NFS does not control the XFS on-disk format, nor does
> the fact that what NFS wants from the change attribute has changed
> over time override the fact that maintaining XFS on-disk format
> compatibility is the responsibility of XFS developers. We're willing
> to change the on-disk format to support whatever the new definition
> of the statx change attribute ends up being, and that should be the
> end of the discussion.
> 

Thanks for spelling this out in more detail.

> > > Keep in mind that we've been doing dynamic inode format updates in
> > > XFS for a couple of decades - users don't even have to be aware that
> > > they need to perform format upgrades because often they just happen
> > > whenever an inode is accessed. IOWs, just because we have to change
> > > the on-disk format to support this new iversion definition, it
> > > doesn't mean users have to reformat filesystems before the new
> > > feature can be used.
> > > 
> > > Hence, over time, as distros update kernels, the XFS iversion
> > > behaviour will change automagically as we update inodes in existing
> > > filesystems as they are accessed to add and then use the new
> > > di_iversion field for the VFS change attribute field instead of the
> > > di_changecount field...
> > > 
> > 
> > If you want to create a whole new on-disk field for this, then that's
> > your prerogative, but before you do that, I'd like to better understand
> > why and how the constraints on this field changed.
> > 
> > The original log message from the commit that added a change counter
> > (below) stated that you were adding it for network filesystems like NFS.
> > When did this change and why?
> 
> It never changed. I'll repeat what I've already explained twice
> before:
> 
> https://lore.kernel.org/linux-xfs/20220818030048.GE3600936@dread.disaster.area/
> https://lore.kernel.org/linux-xfs/20220818033731.GF3600936@dread.disaster.area/
> 
> tl; dr: NFS requirements were just one of *many* we had at the time
> for an atomic persistent change counter.
> 
> The fact is that NFS users are just going to have to put up with
> random cache invalidations on XFS for a while longer. Nobody noticed
> this and/or cared about this enough to raise it as an issue for the
> past decade, so waiting another few months for upstream XFS to
> change to a different on-disk format for the NFS/statx change
> attribute isn't a big deal.
> 

Fair enough. I'll plan to drop this patch from the series for now, with
the expectation that you guys will add a new i_version counter that
better conforms to what NFS and IMA need.

Thanks,
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index b351b9dc6561..866a4c5cf70c 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -323,7 +323,7 @@  struct xfs_inode_log_format_32 {
 #define	XFS_ILOG_ABROOT	0x100	/* log i_af.i_broot */
 #define XFS_ILOG_DOWNER	0x200	/* change the data fork owner on replay */
 #define XFS_ILOG_AOWNER	0x400	/* change the attr fork owner on replay */
-
+#define XFS_ILOG_NOIVER	0x800	/* don't bump i_version */
 
 /*
  * The timestamps are dirty, but not necessarily anything else in the inode
diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 8b5547073379..ffe6d296e7f9 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -126,7 +126,7 @@  xfs_trans_log_inode(
 	 * unconditionally.
 	 */
 	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
-		if (IS_I_VERSION(inode) &&
+		if (!(flags & XFS_ILOG_NOIVER) && IS_I_VERSION(inode) &&
 		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
 			iversion_flags = XFS_ILOG_CORE;
 	}
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 45518b8c613c..94f14d96641b 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1041,10 +1041,17 @@  xfs_vn_update_time(
 		return error;
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (flags & S_CTIME)
+
+	if (!(flags & S_VERSION))
+		log_flags |= XFS_ILOG_NOIVER;
+	if (flags & S_CTIME) {
 		inode->i_ctime = *now;
-	if (flags & S_MTIME)
+		log_flags &= ~XFS_ILOG_NOIVER;
+	}
+	if (flags & S_MTIME) {
 		inode->i_mtime = *now;
+		log_flags &= ~XFS_ILOG_NOIVER;
+	}
 	if (flags & S_ATIME)
 		inode->i_atime = *now;