Message ID | 20220826214703.134870-2-jlayton@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfs: clean up i_version behavior and expose it via statx | expand |
On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote: > The i_version field in the kernel has had different semantics over > the decades, but we're now proposing to expose it to userland via > statx. This means that we need a clear, consistent definition of > what it means and when it should change. > > Update the comments in iversion.h to describe how a conformant > i_version implementation is expected to behave. This definition > suits the current users of i_version (NFSv4 and IMA), but is > loose enough to allow for a wide range of possible implementations. > > Cc: Colin Walters <walters@verbum.org> > Cc: NeilBrown <neilb@suse.de> > Cc: Trond Myklebust <trondmy@hammerspace.com> > Cc: Dave Chinner <david@fromorbit.com> > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > include/linux/iversion.h | 23 +++++++++++++++++++++-- > 1 file changed, 21 insertions(+), 2 deletions(-) > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > index 3bfebde5a1a6..45e93e1b4edc 100644 > --- a/include/linux/iversion.h > +++ b/include/linux/iversion.h > @@ -9,8 +9,19 @@ > * --------------------------- > * The change attribute (i_version) is mandated by NFSv4 and is mostly for > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must > - * appear different to observers if there was a change to the inode's data or > - * metadata since it was last queried. > + * appear different to observers if there was an explicit change to the inode's > + * data or metadata since it was last queried. > + * > + * An explicit change is one that would ordinarily result in a change to the > + * inode status change time (aka ctime). The version must appear to change, even > + * if the ctime does not (since the whole point is to avoid missing updates due > + * to timestamp granularity). If POSIX mandates that the ctime must change due > + * to an operation, then the i_version counter must be incremented as well. > + * > + * A conformant implementation is allowed to increment the counter in other > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine > + * whether caches are up to date. Spurious increments can cause false cache > + * invalidations. "not optimal", but never-the-less allowed - that's "unspecified behaviour" if I've ever seen it. How is userspace supposed to know/deal with this? Indeed, this loophole clause doesn't exist in the man pages that define what statx.stx_ino_version means. The man pages explicitly define that stx_ino_version only ever changes when stx_ctime changes. IOWs, the behaviour userspace developers are going to expect *does not include* stx_ino_version changing it more often than ctime is changed. Hence a kernel iversion implementation that bumps the counter more often than ctime changes *is not conformant with the statx version counter specification*. IOWs, we can't export such behaviour to userspace *ever* - it is a non-conformant implementation. Hence I think anything that bumps iversion outside the bounds of the statx definition should be declared as such: "Non-conformant iversion implementations: - MUST NOT be exported by statx() to userspace - MUST be -tolerated- by kernel internal applications that use iversion for their own purposes." Cheers, Dave.
On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote: > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote: > > The i_version field in the kernel has had different semantics over > > the decades, but we're now proposing to expose it to userland via > > statx. This means that we need a clear, consistent definition of > > what it means and when it should change. > > > > Update the comments in iversion.h to describe how a conformant > > i_version implementation is expected to behave. This definition > > suits the current users of i_version (NFSv4 and IMA), but is > > loose enough to allow for a wide range of possible implementations. > > > > Cc: Colin Walters <walters@verbum.org> > > Cc: NeilBrown <neilb@suse.de> > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > Cc: Dave Chinner <david@fromorbit.com> > > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > include/linux/iversion.h | 23 +++++++++++++++++++++-- > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > > index 3bfebde5a1a6..45e93e1b4edc 100644 > > --- a/include/linux/iversion.h > > +++ b/include/linux/iversion.h > > @@ -9,8 +9,19 @@ > > * --------------------------- > > * The change attribute (i_version) is mandated by NFSv4 and is mostly for > > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must > > - * appear different to observers if there was a change to the inode's data or > > - * metadata since it was last queried. > > + * appear different to observers if there was an explicit change to the inode's > > + * data or metadata since it was last queried. > > + * > > + * An explicit change is one that would ordinarily result in a change to the > > + * inode status change time (aka ctime). The version must appear to change, even > > + * if the ctime does not (since the whole point is to avoid missing updates due > > + * to timestamp granularity). If POSIX mandates that the ctime must change due > > + * to an operation, then the i_version counter must be incremented as well. > > + * > > + * A conformant implementation is allowed to increment the counter in other > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine > > + * whether caches are up to date. Spurious increments can cause false cache > > + * invalidations. > > "not optimal", but never-the-less allowed - that's "unspecified > behaviour" if I've ever seen it. How is userspace supposed to > know/deal with this? > > Indeed, this loophole clause doesn't exist in the man pages that > define what statx.stx_ino_version means. The man pages explicitly > define that stx_ino_version only ever changes when stx_ctime > changes. > We can fix the manpage to make this more clear. > IOWs, the behaviour userspace developers are going to expect *does > not include* stx_ino_version changing it more often than ctime is > changed. Hence a kernel iversion implementation that bumps the > counter more often than ctime changes *is not conformant with the > statx version counter specification*. IOWs, we can't export such > behaviour to userspace *ever* - it is a non-conformant > implementation. > Nonsense. The statx version counter specification is *whatever we decide to make it*. If we define it to allow for spurious version bumps, then these implementations would be conformant. Given that you can't tell what or how much changed in the inode whenever the value changes, allowing it to be bumped on non-observable changes is ok and the counter is still useful. When you see it change you need to go stat/read/getxattr etc, to see what actually happened anyway. Most applications won't be interested in every possible explicit change that can happen to an inode. It's likely these applications would check the parts of the inode they're interested in, and then go back to waiting for the next bump if the change wasn't significant to them. > Hence I think anything that bumps iversion outside the bounds of the > statx definition should be declared as such: > > "Non-conformant iversion implementations: > - MUST NOT be exported by statx() to userspace > - MUST be -tolerated- by kernel internal applications that > use iversion for their own purposes." > I think this is more strict than is needed. An implementation that bumps this value more often than is necessary is still useful. It's not _ideal_, but it still meets the needs of NFSv4, IMA and other potential users of it. After all, this is basically the definition of i_version today and it's still useful, even if atime update i_version bumps are currently harmful for performance.
On Mon, 29 Aug 2022, Jeff Layton wrote: > On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote: > > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote: > > > The i_version field in the kernel has had different semantics over > > > the decades, but we're now proposing to expose it to userland via > > > statx. This means that we need a clear, consistent definition of > > > what it means and when it should change. > > > > > > Update the comments in iversion.h to describe how a conformant > > > i_version implementation is expected to behave. This definition > > > suits the current users of i_version (NFSv4 and IMA), but is > > > loose enough to allow for a wide range of possible implementations. > > > > > > Cc: Colin Walters <walters@verbum.org> > > > Cc: NeilBrown <neilb@suse.de> > > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > > Cc: Dave Chinner <david@fromorbit.com> > > > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > --- > > > include/linux/iversion.h | 23 +++++++++++++++++++++-- > > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > > > index 3bfebde5a1a6..45e93e1b4edc 100644 > > > --- a/include/linux/iversion.h > > > +++ b/include/linux/iversion.h > > > @@ -9,8 +9,19 @@ > > > * --------------------------- > > > * The change attribute (i_version) is mandated by NFSv4 and is mostly for > > > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must > > > - * appear different to observers if there was a change to the inode's data or > > > - * metadata since it was last queried. > > > + * appear different to observers if there was an explicit change to the inode's > > > + * data or metadata since it was last queried. > > > + * > > > + * An explicit change is one that would ordinarily result in a change to the > > > + * inode status change time (aka ctime). The version must appear to change, even > > > + * if the ctime does not (since the whole point is to avoid missing updates due > > > + * to timestamp granularity). If POSIX mandates that the ctime must change due > > > + * to an operation, then the i_version counter must be incremented as well. > > > + * > > > + * A conformant implementation is allowed to increment the counter in other > > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine > > > + * whether caches are up to date. Spurious increments can cause false cache > > > + * invalidations. > > > > "not optimal", but never-the-less allowed - that's "unspecified > > behaviour" if I've ever seen it. How is userspace supposed to > > know/deal with this? > > > > Indeed, this loophole clause doesn't exist in the man pages that > > define what statx.stx_ino_version means. The man pages explicitly > > define that stx_ino_version only ever changes when stx_ctime > > changes. > > > > We can fix the manpage to make this more clear. > > > IOWs, the behaviour userspace developers are going to expect *does > > not include* stx_ino_version changing it more often than ctime is > > changed. Hence a kernel iversion implementation that bumps the > > counter more often than ctime changes *is not conformant with the > > statx version counter specification*. IOWs, we can't export such > > behaviour to userspace *ever* - it is a non-conformant > > implementation. > > > > Nonsense. The statx version counter specification is *whatever we decide > to make it*. If we define it to allow for spurious version bumps, then > these implementations would be conformant. > > Given that you can't tell what or how much changed in the inode whenever > the value changes, allowing it to be bumped on non-observable changes is > ok and the counter is still useful. When you see it change you need to > go stat/read/getxattr etc, to see what actually happened anyway. > > Most applications won't be interested in every possible explicit change > that can happen to an inode. It's likely these applications would check > the parts of the inode they're interested in, and then go back to > waiting for the next bump if the change wasn't significant to them. > > > > Hence I think anything that bumps iversion outside the bounds of the > > statx definition should be declared as such: > > > > "Non-conformant iversion implementations: > > - MUST NOT be exported by statx() to userspace > > - MUST be -tolerated- by kernel internal applications that > > use iversion for their own purposes." > > > > I think this is more strict than is needed. An implementation that bumps > this value more often than is necessary is still useful. It's not > _ideal_, but it still meets the needs of NFSv4, IMA and other potential > users of it. After all, this is basically the definition of i_version > today and it's still useful, even if atime update i_version bumps are > currently harmful for performance. Why do you want to let it be OK? Who is hurt by it being "more strict than needed"? There is an obvious cost in not being strict as an implementation can be compliant but completely useless (increment every nanosecond). So there needs to be a clear benefit to balance this. Who benefits by not being strict? Also: Your spec doesn't say it must increase, only it must be different. So would as hash of all data and metadata be allowed (sysfs might be able to provide that, but probably wouldn't bother). Also: if stray updates are still conformant, can occasional repeated values be still conformant? I would like for a high-precision ctime timestamp to be acceptable, but as time can go backwards it is currently not conformant (even though the xfs iversion which is less useful is actually conformant). NeilBrown
On Mon, Aug 29, 2022 at 06:39:04AM -0400, Jeff Layton wrote: > On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote: > > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote: > > > The i_version field in the kernel has had different semantics over > > > the decades, but we're now proposing to expose it to userland via > > > statx. This means that we need a clear, consistent definition of > > > what it means and when it should change. > > > > > > Update the comments in iversion.h to describe how a conformant > > > i_version implementation is expected to behave. This definition > > > suits the current users of i_version (NFSv4 and IMA), but is > > > loose enough to allow for a wide range of possible implementations. > > > > > > Cc: Colin Walters <walters@verbum.org> > > > Cc: NeilBrown <neilb@suse.de> > > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > > Cc: Dave Chinner <david@fromorbit.com> > > > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > --- > > > include/linux/iversion.h | 23 +++++++++++++++++++++-- > > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > > > index 3bfebde5a1a6..45e93e1b4edc 100644 > > > --- a/include/linux/iversion.h > > > +++ b/include/linux/iversion.h > > > @@ -9,8 +9,19 @@ > > > * --------------------------- > > > * The change attribute (i_version) is mandated by NFSv4 and is mostly for > > > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must > > > - * appear different to observers if there was a change to the inode's data or > > > - * metadata since it was last queried. > > > + * appear different to observers if there was an explicit change to the inode's > > > + * data or metadata since it was last queried. > > > + * > > > + * An explicit change is one that would ordinarily result in a change to the > > > + * inode status change time (aka ctime). The version must appear to change, even > > > + * if the ctime does not (since the whole point is to avoid missing updates due > > > + * to timestamp granularity). If POSIX mandates that the ctime must change due > > > + * to an operation, then the i_version counter must be incremented as well. > > > + * > > > + * A conformant implementation is allowed to increment the counter in other > > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine > > > + * whether caches are up to date. Spurious increments can cause false cache > > > + * invalidations. > > > > "not optimal", but never-the-less allowed - that's "unspecified > > behaviour" if I've ever seen it. How is userspace supposed to > > know/deal with this? > > > > Indeed, this loophole clause doesn't exist in the man pages that > > define what statx.stx_ino_version means. The man pages explicitly > > define that stx_ino_version only ever changes when stx_ctime > > changes. > > > > We can fix the manpage to make this more clear. > > > IOWs, the behaviour userspace developers are going to expect *does > > not include* stx_ino_version changing it more often than ctime is > > changed. Hence a kernel iversion implementation that bumps the > > counter more often than ctime changes *is not conformant with the > > statx version counter specification*. IOWs, we can't export such > > behaviour to userspace *ever* - it is a non-conformant > > implementation. > > > > Nonsense. The statx version counter specification is *whatever we decide > to make it*. Yes, but... > If we define it to allow for spurious version bumps, then > these implementations would be conformant. ... that's _not how you defined stx_ino_version to behave_! > Given that you can't tell what or how much changed in the inode whenever > the value changes, allowing it to be bumped on non-observable changes is > ok and the counter is still useful. When you see it change you need to > go stat/read/getxattr etc, to see what actually happened anyway. IDGI. If this is acceptible, then you're forcing userspace into "store and filter" implementations as the only viable method of using the change notification usefully. That means atime is just another attribute in the "store and filter" algorithm, so if this is how we define stx_ino_version behaviour, why carve out an explicit exception for atime? > Most applications won't be interested in every possible explicit change > that can happen to an inode. It's likely these applications would check > the parts of the inode they're interested in, and then go back to > waiting for the next bump if the change wasn't significant to them. Yes, that is exactly my point. You make the argument that we must not bump iversion in certain situations (atime) because it will cause spurious cache invalidations, but then say it is OK to bump it in others regardless of the fact that it will cause spurious cache invalidations. And you justify this latter behaviour by saying it is up to the application to avoid spurious invalidations by using "store and filter" algorithms. If the application has to store state and filter changes indicated by stx_ino_version changing, then by definition *it must be capable of filtering iversion bumps as a result of atime changes*. The iversion exception carved out for atime requires the application to implement "store and filter" algorithms only if it needs to care about atime changes. The "invisible bump" exception carved out here *requires* applications to implement "store and filter" algorithms to filter out invisible bumps. Hence if we combine both these behaviours, atime bumping iversion appears to userspace exactly the same as "invisible bump occurred, followed by access that changes atime". IOWs, userspace cannot tell the difference between a filesystem implementation that doesn't bump iversion on atime but has invisible bump, and a filesystem that bumps iversion on atime updates and so it always needs to filter atime changes if it doesn't care about them. Hence if stx_ino_version can have invisible bumps, it makes no difference to userspace if atime updates bump iversion or not. They will have to filter atime if they don't care about it, and they have to store the new stx_ino_version every time they filter out an invisible bump that doesn't change anything their filters care about (e.g. atime!). At which point I have to ask: if we are expecting userspace to filter out invisible iversion bumps because that's allowed, conformant behaviour, then why aren't we requiring both the NFS server and IMA applications to filter spurious iversion bumps as well? > > Hence I think anything that bumps iversion outside the bounds of the > > statx definition should be declared as such: > > > > "Non-conformant iversion implementations: > > - MUST NOT be exported by statx() to userspace > > - MUST be -tolerated- by kernel internal applications that > > use iversion for their own purposes." > > > > I think this is more strict than is needed. An implementation that bumps > this value more often than is necessary is still useful. I never said that non-conformant implementations aren't useful. What I said is they aren't conformant with the provided definition of stx_ino_version, and as a result we should not allow them to be exposed to userspace. Cheers, Dave.
On Tue, 2022-08-30 at 08:58 +1000, NeilBrown wrote: > On Mon, 29 Aug 2022, Jeff Layton wrote: > > On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote: > > > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote: > > > > The i_version field in the kernel has had different semantics over > > > > the decades, but we're now proposing to expose it to userland via > > > > statx. This means that we need a clear, consistent definition of > > > > what it means and when it should change. > > > > > > > > Update the comments in iversion.h to describe how a conformant > > > > i_version implementation is expected to behave. This definition > > > > suits the current users of i_version (NFSv4 and IMA), but is > > > > loose enough to allow for a wide range of possible implementations. > > > > > > > > Cc: Colin Walters <walters@verbum.org> > > > > Cc: NeilBrown <neilb@suse.de> > > > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > > > Cc: Dave Chinner <david@fromorbit.com> > > > > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > --- > > > > include/linux/iversion.h | 23 +++++++++++++++++++++-- > > > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > > > > index 3bfebde5a1a6..45e93e1b4edc 100644 > > > > --- a/include/linux/iversion.h > > > > +++ b/include/linux/iversion.h > > > > @@ -9,8 +9,19 @@ > > > > * --------------------------- > > > > * The change attribute (i_version) is mandated by NFSv4 and is mostly for > > > > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must > > > > - * appear different to observers if there was a change to the inode's data or > > > > - * metadata since it was last queried. > > > > + * appear different to observers if there was an explicit change to the inode's > > > > + * data or metadata since it was last queried. > > > > + * > > > > + * An explicit change is one that would ordinarily result in a change to the > > > > + * inode status change time (aka ctime). The version must appear to change, even > > > > + * if the ctime does not (since the whole point is to avoid missing updates due > > > > + * to timestamp granularity). If POSIX mandates that the ctime must change due > > > > + * to an operation, then the i_version counter must be incremented as well. > > > > + * > > > > + * A conformant implementation is allowed to increment the counter in other > > > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine > > > > + * whether caches are up to date. Spurious increments can cause false cache > > > > + * invalidations. > > > > > > "not optimal", but never-the-less allowed - that's "unspecified > > > behaviour" if I've ever seen it. How is userspace supposed to > > > know/deal with this? > > > > > > Indeed, this loophole clause doesn't exist in the man pages that > > > define what statx.stx_ino_version means. The man pages explicitly > > > define that stx_ino_version only ever changes when stx_ctime > > > changes. > > > > > > > We can fix the manpage to make this more clear. > > > > > IOWs, the behaviour userspace developers are going to expect *does > > > not include* stx_ino_version changing it more often than ctime is > > > changed. Hence a kernel iversion implementation that bumps the > > > counter more often than ctime changes *is not conformant with the > > > statx version counter specification*. IOWs, we can't export such > > > behaviour to userspace *ever* - it is a non-conformant > > > implementation. > > > > > > > Nonsense. The statx version counter specification is *whatever we decide > > to make it*. If we define it to allow for spurious version bumps, then > > these implementations would be conformant. > > > > Given that you can't tell what or how much changed in the inode whenever > > the value changes, allowing it to be bumped on non-observable changes is > > ok and the counter is still useful. When you see it change you need to > > go stat/read/getxattr etc, to see what actually happened anyway. > > > > Most applications won't be interested in every possible explicit change > > that can happen to an inode. It's likely these applications would check > > the parts of the inode they're interested in, and then go back to > > waiting for the next bump if the change wasn't significant to them. > > > > > > > Hence I think anything that bumps iversion outside the bounds of the > > > statx definition should be declared as such: > > > > > > "Non-conformant iversion implementations: > > > - MUST NOT be exported by statx() to userspace > > > - MUST be -tolerated- by kernel internal applications that > > > use iversion for their own purposes." > > > > > > > I think this is more strict than is needed. An implementation that bumps > > this value more often than is necessary is still useful. It's not > > _ideal_, but it still meets the needs of NFSv4, IMA and other potential > > users of it. After all, this is basically the definition of i_version > > today and it's still useful, even if atime update i_version bumps are > > currently harmful for performance. > > Why do you want to let it be OK? Who is hurt by it being "more strict > than needed"? There is an obvious cost in not being strict as an > implementation can be compliant but completely useless (increment every > nanosecond). So there needs to be a clear benefit to balance this. Who > benefits by not being strict? > Other filesystems that may not be able to provide the strict semantics required. I don't have any names to name here -- I'm just trying to ensure that we don't paint ourselves into a corner with rules that are more strict than we really need. If the consensus is that we should keep the definition strict, then Ican live with that. That might narrow the number of filesystems that can provide this attribute though. > Also: Your spec doesn't say it must increase, only it must be different. > So would as hash of all data and metadata be allowed (sysfs might be > able to provide that, but probably wouldn't bother). > > Also: if stray updates are still conformant, can occasional repeated > values be still conformant? I would like for a high-precision ctime > timestamp to be acceptable, but as time can go backwards it is currently > not conformant (even though the xfs iversion which is less useful is > actually conformant). > Yes, saying only that it must be different is intentional. What we really want is for consumers to treat this as an opaque value for the most part [1]. Therefore an implementation based on hashing would conform to the spec, I'd think, as long as all of the relevant info is part of the hash. OTOH, hash collisions could be an issue here and I think we need to avoid allowing duplicate values. When it comes to watching for changes to an inode, false positives are generally ok since that should just affect performance but not functionality. False negatives are a different matter. They can lead to cache coherency issues with NFS, or missing remeasurements in IMA. Userland applications that use this could be subject to similar issues.
On Tue, 2022-08-30 at 11:04 +1000, Dave Chinner wrote: > On Mon, Aug 29, 2022 at 06:39:04AM -0400, Jeff Layton wrote: > > On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote: > > > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote: > > > > The i_version field in the kernel has had different semantics over > > > > the decades, but we're now proposing to expose it to userland via > > > > statx. This means that we need a clear, consistent definition of > > > > what it means and when it should change. > > > > > > > > Update the comments in iversion.h to describe how a conformant > > > > i_version implementation is expected to behave. This definition > > > > suits the current users of i_version (NFSv4 and IMA), but is > > > > loose enough to allow for a wide range of possible implementations. > > > > > > > > Cc: Colin Walters <walters@verbum.org> > > > > Cc: NeilBrown <neilb@suse.de> > > > > Cc: Trond Myklebust <trondmy@hammerspace.com> > > > > Cc: Dave Chinner <david@fromorbit.com> > > > > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > --- > > > > include/linux/iversion.h | 23 +++++++++++++++++++++-- > > > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > > > > index 3bfebde5a1a6..45e93e1b4edc 100644 > > > > --- a/include/linux/iversion.h > > > > +++ b/include/linux/iversion.h > > > > @@ -9,8 +9,19 @@ > > > > * --------------------------- > > > > * The change attribute (i_version) is mandated by NFSv4 and is mostly for > > > > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must > > > > - * appear different to observers if there was a change to the inode's data or > > > > - * metadata since it was last queried. > > > > + * appear different to observers if there was an explicit change to the inode's > > > > + * data or metadata since it was last queried. > > > > + * > > > > + * An explicit change is one that would ordinarily result in a change to the > > > > + * inode status change time (aka ctime). The version must appear to change, even > > > > + * if the ctime does not (since the whole point is to avoid missing updates due > > > > + * to timestamp granularity). If POSIX mandates that the ctime must change due > > > > + * to an operation, then the i_version counter must be incremented as well. > > > > + * > > > > + * A conformant implementation is allowed to increment the counter in other > > > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine > > > > + * whether caches are up to date. Spurious increments can cause false cache > > > > + * invalidations. > > > > > > "not optimal", but never-the-less allowed - that's "unspecified > > > behaviour" if I've ever seen it. How is userspace supposed to > > > know/deal with this? > > > > > > Indeed, this loophole clause doesn't exist in the man pages that > > > define what statx.stx_ino_version means. The man pages explicitly > > > define that stx_ino_version only ever changes when stx_ctime > > > changes. > > > > > > > We can fix the manpage to make this more clear. > > > > > IOWs, the behaviour userspace developers are going to expect *does > > > not include* stx_ino_version changing it more often than ctime is > > > changed. Hence a kernel iversion implementation that bumps the > > > counter more often than ctime changes *is not conformant with the > > > statx version counter specification*. IOWs, we can't export such > > > behaviour to userspace *ever* - it is a non-conformant > > > implementation. > > > > > > > Nonsense. The statx version counter specification is *whatever we decide > > to make it*. > > Yes, but... > > > If we define it to allow for spurious version bumps, then > > these implementations would be conformant. > > ... that's _not how you defined stx_ino_version to behave_! > I certainly didn't say that it must _only_ be incremented when the ctime would change, only that if the ctime would change that it must be incremented. The weasel-words make all the difference. But, point taken, the spec should be explicit about this. I'll plan to revise the manpage patch and resend it. > > Given that you can't tell what or how much changed in the inode whenever > > the value changes, allowing it to be bumped on non-observable changes is > > ok and the counter is still useful. When you see it change you need to > > go stat/read/getxattr etc, to see what actually happened anyway. > > IDGI. If this is acceptible, then you're forcing userspace into > "store and filter" implementations as the only viable method of > using the change notification usefully. > Well, that's all it's really useful for anyway. The counter itself has tells you nothing other than that something changed. > That means atime is just another attribute in the "store and > filter" algorithm, so if this is how we define stx_ino_version > behaviour, why carve out an explicit exception for atime? > Because atime updates are particularly problematic. Ideally, we'd filter out all implicit updates, but that may not be feasible in all cases. > > Most applications won't be interested in every possible explicit change > > that can happen to an inode. It's likely these applications would check > > the parts of the inode they're interested in, and then go back to > > waiting for the next bump if the change wasn't significant to them. > > Yes, that is exactly my point. > > You make the argument that we must not bump iversion in certain > situations (atime) because it will cause spurious cache > invalidations, but then say it is OK to bump it in others regardless > of the fact that it will cause spurious cache invalidations. And you > justify this latter behaviour by saying it is up to the application > to avoid spurious invalidations by using "store and filter" > algorithms. > > If the application has to store state and filter changes indicated > by stx_ino_version changing, then by definition *it must be capable > of filtering iversion bumps as a result of atime changes*. > > The iversion exception carved out for atime requires the application > to implement "store and filter" algorithms only if it needs to care > about atime changes. The "invisible bump" exception carved out here > *requires* applications to implement "store and filter" algorithms > to filter out invisible bumps. > > Hence if we combine both these behaviours, atime bumping iversion > appears to userspace exactly the same as "invisible bump occurred, > followed by access that changes atime". IOWs, userspace cannot tell the > difference between a filesystem implementation that doesn't bump > iversion on atime but has invisible bump, and a filesystem that > bumps iversion on atime updates and so it always needs to filter > atime changes if it doesn't care about them. > > Hence if stx_ino_version can have invisible bumps, it makes no > difference to userspace if atime updates bump iversion or not. They > will have to filter atime if they don't care about it, and they have > to store the new stx_ino_version every time they filter out an > invisible bump that doesn't change anything their filters care > about (e.g. atime!). > > At which point I have to ask: if we are expecting userspace to > filter out invisible iversion bumps because that's allowed, > conformant behaviour, then why aren't we requiring both the NFS > server and IMA applications to filter spurious iversion bumps as > well? > I think you're reading too much into my attempt to carve out atime from i_version updates. That is purely a pragmatic attempt to staunch the worst of the bleeding from this problem. atime updates are _very_ frequent, and the default relatime behavior ensures that each NFS client reading file data or dir info after a change to it will end up downloading it at least twice: once to read the file itself, and then again after it drops its cache due to the atime update from the prior read. In an ideal world, an implementation would only bump the i_version on explicit changes. If an implicit change only happens infrequently, or always happens in very close succession after an explicit change (such that readers can't race in as easily) then that's less harmful for performance. The i_version value itself can't tell you anything about the inode. It's only useful for comparing to an earlier sample to see if it is has changed. This attribute is mostly useful in the context of a store-and- invalidate kind of system. We want to keep the invalidations to a minimum, but eliminating spurious updates is more of an optimization problem than one of correctness. It would be best if we could eliminate all spurious updates, but I think the stx_ino_version is just as useful without being that strict, and that leaves the door open for other implementations that aren't able to filter out all spurious updates. > > > Hence I think anything that bumps iversion outside the bounds of the > > > statx definition should be declared as such: > > > > > > "Non-conformant iversion implementations: > > > - MUST NOT be exported by statx() to userspace > > > - MUST be -tolerated- by kernel internal applications that > > > use iversion for their own purposes." > > > > > > > I think this is more strict than is needed. An implementation that bumps > > this value more often than is necessary is still useful. > > I never said that non-conformant implementations aren't useful. What > I said is they aren't conformant with the provided definition of > stx_ino_version, and as a result we should not allow them to be > exposed to userspace. As I said above, I'll respin the manpage patch to better specify what a conformant implementation can and can't do, and we can discuss from there.
On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote: > Yes, saying only that it must be different is intentional. What we > really want is for consumers to treat this as an opaque value for the > most part [1]. Therefore an implementation based on hashing would > conform to the spec, I'd think, as long as all of the relevant info is > part of the hash. It'd conform, but it might not be as useful as an increasing value. E.g. a client can use that to work out which of a series of reordered write replies is the most recent, and I seem to recall that can prevent unnecessary invalidations in some cases. --b.
On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote: > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote: > > Yes, saying only that it must be different is intentional. What we > > really want is for consumers to treat this as an opaque value for the > > most part [1]. Therefore an implementation based on hashing would > > conform to the spec, I'd think, as long as all of the relevant info is > > part of the hash. > > It'd conform, but it might not be as useful as an increasing value. > > E.g. a client can use that to work out which of a series of reordered > write replies is the most recent, and I seem to recall that can prevent > unnecessary invalidations in some cases. > That's a good point; the linux client does this. That said, NFSv4 has a way for the server to advertise its change attribute behavior [1] (though nfsd hasn't implemented this yet). We don't have a good way to do that in userland for now. This is another place where fsinfo() would have been nice to have. I think until we have something like that, we'd want to keep our promises to userland to a minimum. [1]: https://www.rfc-editor.org/rfc/rfc7862.html#section-12.2.3 . I guess I should look at plumbing this in for IS_I_VERSION inodes...
On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote: > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote: > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote: > > > Yes, saying only that it must be different is intentional. What we > > > really want is for consumers to treat this as an opaque value for the > > > most part [1]. Therefore an implementation based on hashing would > > > conform to the spec, I'd think, as long as all of the relevant info is > > > part of the hash. > > > > It'd conform, but it might not be as useful as an increasing value. > > > > E.g. a client can use that to work out which of a series of reordered > > write replies is the most recent, and I seem to recall that can prevent > > unnecessary invalidations in some cases. > > > > That's a good point; the linux client does this. That said, NFSv4 has a > way for the server to advertise its change attribute behavior [1] > (though nfsd hasn't implemented this yet). It was implemented and reverted. The issue was that I thought nfsd should mix in the ctime to prevent the change attribute going backwards on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but Trond was concerned about the possibility of time going backwards. See 1631087ba872 "Revert "nfsd4: support change_attr_type attribute"". There's some mailing list discussion to that I'm not turning up right now. Did NFSv4 add change_attr_type because some implementations needed the unordered case, or because they realized ordering was useful but wanted to keep backwards compatibility? I don't know which it was. --b. > We don't have a good way to > do that in userland for now. > > This is another place where fsinfo() would have been nice to have. I > think until we have something like that, we'd want to keep our promises > to userland to a minimum. > > [1]: https://www.rfc-editor.org/rfc/rfc7862.html#section-12.2.3 . I > guess I should look at plumbing this in for IS_I_VERSION inodes... > > -- > Jeff Layton <jlayton@kernel.org> > >
On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote: > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote: > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote: > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote: > > > > Yes, saying only that it must be different is intentional. What > > > > we > > > > really want is for consumers to treat this as an opaque value > > > > for the > > > > most part [1]. Therefore an implementation based on hashing > > > > would > > > > conform to the spec, I'd think, as long as all of the relevant > > > > info is > > > > part of the hash. > > > > > > It'd conform, but it might not be as useful as an increasing > > > value. > > > > > > E.g. a client can use that to work out which of a series of > > > reordered > > > write replies is the most recent, and I seem to recall that can > > > prevent > > > unnecessary invalidations in some cases. > > > > > > > That's a good point; the linux client does this. That said, NFSv4 > > has a > > way for the server to advertise its change attribute behavior [1] > > (though nfsd hasn't implemented this yet). > > It was implemented and reverted. The issue was that I thought nfsd > should mix in the ctime to prevent the change attribute going > backwards > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but Trond > was > concerned about the possibility of time going backwards. See > 1631087ba872 "Revert "nfsd4: support change_attr_type attribute"". > There's some mailing list discussion to that I'm not turning up right > now. My main concern was that some filesystems (e.g. ext3) were failing to provide sufficient timestamp resolution to actually label the resulting 'change attribute' as being updated monotonically. If the time stamp doesn't change when the file data or metadata are changed, then the client has to perform extra checks to try to figure out whether or not its caches are up to date. > > Did NFSv4 add change_attr_type because some implementations needed > the > unordered case, or because they realized ordering was useful but > wanted > to keep backwards compatibility? I don't know which it was. We implemented it because, as implied above, knowledge of whether or not the change attribute behaves monotonically, or strictly monotonically, enables a number of optimisations.
On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote: > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote: > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote: > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote: > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote: > > > > > Yes, saying only that it must be different is intentional. What > > > > > we > > > > > really want is for consumers to treat this as an opaque value > > > > > for the > > > > > most part [1]. Therefore an implementation based on hashing > > > > > would > > > > > conform to the spec, I'd think, as long as all of the relevant > > > > > info is > > > > > part of the hash. > > > > > > > > It'd conform, but it might not be as useful as an increasing > > > > value. > > > > > > > > E.g. a client can use that to work out which of a series of > > > > reordered > > > > write replies is the most recent, and I seem to recall that can > > > > prevent > > > > unnecessary invalidations in some cases. > > > > > > > > > > That's a good point; the linux client does this. That said, NFSv4 > > > has a > > > way for the server to advertise its change attribute behavior [1] > > > (though nfsd hasn't implemented this yet). > > > > It was implemented and reverted. The issue was that I thought nfsd > > should mix in the ctime to prevent the change attribute going > > backwards > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but Trond > > was > > concerned about the possibility of time going backwards. See > > 1631087ba872 "Revert "nfsd4: support change_attr_type attribute"". > > There's some mailing list discussion to that I'm not turning up right > > now. https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/ is what I was thinking of but it isn't actually that interesting. > My main concern was that some filesystems (e.g. ext3) were failing to > provide sufficient timestamp resolution to actually label the resulting > 'change attribute' as being updated monotonically. If the time stamp > doesn't change when the file data or metadata are changed, then the > client has to perform extra checks to try to figure out whether or not > its caches are up to date. That's a different issue from the one you were raising in that discussion. > > Did NFSv4 add change_attr_type because some implementations needed > > the > > unordered case, or because they realized ordering was useful but > > wanted > > to keep backwards compatibility? I don't know which it was. > > We implemented it because, as implied above, knowledge of whether or > not the change attribute behaves monotonically, or strictly > monotonically, enables a number of optimisations. Of course, but my question was about the value of the old behavior, not about the value of the monotonic behavior. Put differently, if we could redesign the protocol from scratch would we actually have included the option of non-monotonic behavior? --b.
On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote: > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote: > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote: > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote: > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote: > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote: > > > > > > Yes, saying only that it must be different is intentional. > > > > > > What > > > > > > we > > > > > > really want is for consumers to treat this as an opaque > > > > > > value > > > > > > for the > > > > > > most part [1]. Therefore an implementation based on hashing > > > > > > would > > > > > > conform to the spec, I'd think, as long as all of the > > > > > > relevant > > > > > > info is > > > > > > part of the hash. > > > > > > > > > > It'd conform, but it might not be as useful as an increasing > > > > > value. > > > > > > > > > > E.g. a client can use that to work out which of a series of > > > > > reordered > > > > > write replies is the most recent, and I seem to recall that > > > > > can > > > > > prevent > > > > > unnecessary invalidations in some cases. > > > > > > > > > > > > > That's a good point; the linux client does this. That said, > > > > NFSv4 > > > > has a > > > > way for the server to advertise its change attribute behavior > > > > [1] > > > > (though nfsd hasn't implemented this yet). > > > > > > It was implemented and reverted. The issue was that I thought > > > nfsd > > > should mix in the ctime to prevent the change attribute going > > > backwards > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but > > > Trond > > > was > > > concerned about the possibility of time going backwards. See > > > 1631087ba872 "Revert "nfsd4: support change_attr_type > > > attribute"". > > > There's some mailing list discussion to that I'm not turning up > > > right > > > now. > > https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/ > is what I was thinking of but it isn't actually that interesting. > > > My main concern was that some filesystems (e.g. ext3) were failing > > to > > provide sufficient timestamp resolution to actually label the > > resulting > > 'change attribute' as being updated monotonically. If the time > > stamp > > doesn't change when the file data or metadata are changed, then the > > client has to perform extra checks to try to figure out whether or > > not > > its caches are up to date. > > That's a different issue from the one you were raising in that > discussion. > > > > Did NFSv4 add change_attr_type because some implementations > > > needed > > > the > > > unordered case, or because they realized ordering was useful but > > > wanted > > > to keep backwards compatibility? I don't know which it was. > > > > We implemented it because, as implied above, knowledge of whether > > or > > not the change attribute behaves monotonically, or strictly > > monotonically, enables a number of optimisations. > > Of course, but my question was about the value of the old behavior, > not > about the value of the monotonic behavior. > > Put differently, if we could redesign the protocol from scratch would > we > actually have included the option of non-monotonic behavior? > If we could design the filesystems from scratch, we probably would not. The protocol ended up being as it is because people were trying to make it as easy to implement as possible. So if we could design the filesystem from scratch, we would have probably designed it along the lines of what AFS does. i.e. each explicit change is accompanied by a single bump of the change attribute, so that the clients can not only decide the order of the resulting changes, but also if they have missed a change (that might have been made by a different client). However that would be a requirement that is likely to be very specific to distributed caches (and hence distributed filesystems). I doubt there are many user space applications that would need that high precision. Maybe MPI, but that's the only candidate I can think of for now?
On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote: > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote: > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote: > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote: > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote: > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote: > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote: > > > > > > > Yes, saying only that it must be different is intentional. > > > > > > > What > > > > > > > we > > > > > > > really want is for consumers to treat this as an opaque > > > > > > > value > > > > > > > for the > > > > > > > most part [1]. Therefore an implementation based on hashing > > > > > > > would > > > > > > > conform to the spec, I'd think, as long as all of the > > > > > > > relevant > > > > > > > info is > > > > > > > part of the hash. > > > > > > > > > > > > It'd conform, but it might not be as useful as an increasing > > > > > > value. > > > > > > > > > > > > E.g. a client can use that to work out which of a series of > > > > > > reordered > > > > > > write replies is the most recent, and I seem to recall that > > > > > > can > > > > > > prevent > > > > > > unnecessary invalidations in some cases. > > > > > > > > > > > > > > > > That's a good point; the linux client does this. That said, > > > > > NFSv4 > > > > > has a > > > > > way for the server to advertise its change attribute behavior > > > > > [1] > > > > > (though nfsd hasn't implemented this yet). > > > > > > > > It was implemented and reverted. The issue was that I thought > > > > nfsd > > > > should mix in the ctime to prevent the change attribute going > > > > backwards > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but > > > > Trond > > > > was > > > > concerned about the possibility of time going backwards. See > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type > > > > attribute"". > > > > There's some mailing list discussion to that I'm not turning up > > > > right > > > > now. > > > > https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/ > > is what I was thinking of but it isn't actually that interesting. > > > > > My main concern was that some filesystems (e.g. ext3) were failing > > > to > > > provide sufficient timestamp resolution to actually label the > > > resulting > > > 'change attribute' as being updated monotonically. If the time > > > stamp > > > doesn't change when the file data or metadata are changed, then the > > > client has to perform extra checks to try to figure out whether or > > > not > > > its caches are up to date. > > > > That's a different issue from the one you were raising in that > > discussion. > > > > > > Did NFSv4 add change_attr_type because some implementations > > > > needed > > > > the > > > > unordered case, or because they realized ordering was useful but > > > > wanted > > > > to keep backwards compatibility? I don't know which it was. > > > > > > We implemented it because, as implied above, knowledge of whether > > > or > > > not the change attribute behaves monotonically, or strictly > > > monotonically, enables a number of optimisations. > > > > Of course, but my question was about the value of the old behavior, > > not > > about the value of the monotonic behavior. > > > > Put differently, if we could redesign the protocol from scratch would > > we > > actually have included the option of non-monotonic behavior? > > > > If we could design the filesystems from scratch, we probably would not. > The protocol ended up being as it is because people were trying to make > it as easy to implement as possible. > > So if we could design the filesystem from scratch, we would have > probably designed it along the lines of what AFS does. > i.e. each explicit change is accompanied by a single bump of the change > attribute, so that the clients can not only decide the order of the > resulting changes, but also if they have missed a change (that might > have been made by a different client). > > However that would be a requirement that is likely to be very specific > to distributed caches (and hence distributed filesystems). I doubt > there are many user space applications that would need that high > precision. Maybe MPI, but that's the only candidate I can think of for > now? > The fact that NFS kept this more loosely-defined is what allowed us to elide some of the i_version bumps and regain a fair bit of performance for local filesystems [1]. If the change attribute had been more strictly defined like you mention, then that particular optimization would not have been possible. This sort of thing is why I'm a fan of not defining this any more strictly than we require. Later on, maybe we'll come up with a way for filesystems to advertise that they can offer stronger guarantees.
On Tue, 2022-08-30 at 13:02 -0400, Jeff Layton wrote: > On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote: > > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote: > > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote: > > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote: > > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote: > > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote: > > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton > > > > > > > wrote: > > > > > > > > Yes, saying only that it must be different is > > > > > > > > intentional. > > > > > > > > What > > > > > > > > we > > > > > > > > really want is for consumers to treat this as an opaque > > > > > > > > value > > > > > > > > for the > > > > > > > > most part [1]. Therefore an implementation based on > > > > > > > > hashing > > > > > > > > would > > > > > > > > conform to the spec, I'd think, as long as all of the > > > > > > > > relevant > > > > > > > > info is > > > > > > > > part of the hash. > > > > > > > > > > > > > > It'd conform, but it might not be as useful as an > > > > > > > increasing > > > > > > > value. > > > > > > > > > > > > > > E.g. a client can use that to work out which of a series > > > > > > > of > > > > > > > reordered > > > > > > > write replies is the most recent, and I seem to recall > > > > > > > that > > > > > > > can > > > > > > > prevent > > > > > > > unnecessary invalidations in some cases. > > > > > > > > > > > > > > > > > > > That's a good point; the linux client does this. That said, > > > > > > NFSv4 > > > > > > has a > > > > > > way for the server to advertise its change attribute > > > > > > behavior > > > > > > [1] > > > > > > (though nfsd hasn't implemented this yet). > > > > > > > > > > It was implemented and reverted. The issue was that I > > > > > thought > > > > > nfsd > > > > > should mix in the ctime to prevent the change attribute going > > > > > backwards > > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but > > > > > Trond > > > > > was > > > > > concerned about the possibility of time going backwards. See > > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type > > > > > attribute"". > > > > > There's some mailing list discussion to that I'm not turning > > > > > up > > > > > right > > > > > now. > > > > > > https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/ > > > is what I was thinking of but it isn't actually that interesting. > > > > > > > My main concern was that some filesystems (e.g. ext3) were > > > > failing > > > > to > > > > provide sufficient timestamp resolution to actually label the > > > > resulting > > > > 'change attribute' as being updated monotonically. If the time > > > > stamp > > > > doesn't change when the file data or metadata are changed, then > > > > the > > > > client has to perform extra checks to try to figure out whether > > > > or > > > > not > > > > its caches are up to date. > > > > > > That's a different issue from the one you were raising in that > > > discussion. > > > > > > > > Did NFSv4 add change_attr_type because some implementations > > > > > needed > > > > > the > > > > > unordered case, or because they realized ordering was useful > > > > > but > > > > > wanted > > > > > to keep backwards compatibility? I don't know which it was. > > > > > > > > We implemented it because, as implied above, knowledge of > > > > whether > > > > or > > > > not the change attribute behaves monotonically, or strictly > > > > monotonically, enables a number of optimisations. > > > > > > Of course, but my question was about the value of the old > > > behavior, > > > not > > > about the value of the monotonic behavior. > > > > > > Put differently, if we could redesign the protocol from scratch > > > would > > > we > > > actually have included the option of non-monotonic behavior? > > > > > > > If we could design the filesystems from scratch, we probably would > > not. > > The protocol ended up being as it is because people were trying to > > make > > it as easy to implement as possible. > > > > So if we could design the filesystem from scratch, we would have > > probably designed it along the lines of what AFS does. > > i.e. each explicit change is accompanied by a single bump of the > > change > > attribute, so that the clients can not only decide the order of the > > resulting changes, but also if they have missed a change (that > > might > > have been made by a different client). > > > > However that would be a requirement that is likely to be very > > specific > > to distributed caches (and hence distributed filesystems). I doubt > > there are many user space applications that would need that high > > precision. Maybe MPI, but that's the only candidate I can think of > > for > > now? > > > > The fact that NFS kept this more loosely-defined is what allowed us > to > elide some of the i_version bumps and regain a fair bit of > performance > for local filesystems [1]. If the change attribute had been more > strictly defined like you mention, then that particular optimization > would not have been possible. > > This sort of thing is why I'm a fan of not defining this any more > strictly than we require. Later on, maybe we'll come up with a way > for > filesystems to advertise that they can offer stronger guarantees. What 'eliding of the bumps' are we talking about here? If it results in unreliable behaviour, then I propose we just drop the whole concept and go back to using the ctime. The change attribute is only useful if it results in a reliable mechanism for detecting changes. Once you "elide away" the word "reliable", then it has no value beyond what ctime already does.
On Tue, 2022-08-30 at 17:47 +0000, Trond Myklebust wrote: > On Tue, 2022-08-30 at 13:02 -0400, Jeff Layton wrote: > > On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote: > > > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote: > > > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote: > > > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote: > > > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote: > > > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote: > > > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton > > > > > > > > wrote: > > > > > > > > > Yes, saying only that it must be different is > > > > > > > > > intentional. > > > > > > > > > What > > > > > > > > > we > > > > > > > > > really want is for consumers to treat this as an opaque > > > > > > > > > value > > > > > > > > > for the > > > > > > > > > most part [1]. Therefore an implementation based on > > > > > > > > > hashing > > > > > > > > > would > > > > > > > > > conform to the spec, I'd think, as long as all of the > > > > > > > > > relevant > > > > > > > > > info is > > > > > > > > > part of the hash. > > > > > > > > > > > > > > > > It'd conform, but it might not be as useful as an > > > > > > > > increasing > > > > > > > > value. > > > > > > > > > > > > > > > > E.g. a client can use that to work out which of a series > > > > > > > > of > > > > > > > > reordered > > > > > > > > write replies is the most recent, and I seem to recall > > > > > > > > that > > > > > > > > can > > > > > > > > prevent > > > > > > > > unnecessary invalidations in some cases. > > > > > > > > > > > > > > > > > > > > > > That's a good point; the linux client does this. That said, > > > > > > > NFSv4 > > > > > > > has a > > > > > > > way for the server to advertise its change attribute > > > > > > > behavior > > > > > > > [1] > > > > > > > (though nfsd hasn't implemented this yet). > > > > > > > > > > > > It was implemented and reverted. The issue was that I > > > > > > thought > > > > > > nfsd > > > > > > should mix in the ctime to prevent the change attribute going > > > > > > backwards > > > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but > > > > > > Trond > > > > > > was > > > > > > concerned about the possibility of time going backwards. See > > > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type > > > > > > attribute"". > > > > > > There's some mailing list discussion to that I'm not turning > > > > > > up > > > > > > right > > > > > > now. > > > > > > > > https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/ > > > > is what I was thinking of but it isn't actually that interesting. > > > > > > > > > My main concern was that some filesystems (e.g. ext3) were > > > > > failing > > > > > to > > > > > provide sufficient timestamp resolution to actually label the > > > > > resulting > > > > > 'change attribute' as being updated monotonically. If the time > > > > > stamp > > > > > doesn't change when the file data or metadata are changed, then > > > > > the > > > > > client has to perform extra checks to try to figure out whether > > > > > or > > > > > not > > > > > its caches are up to date. > > > > > > > > That's a different issue from the one you were raising in that > > > > discussion. > > > > > > > > > > Did NFSv4 add change_attr_type because some implementations > > > > > > needed > > > > > > the > > > > > > unordered case, or because they realized ordering was useful > > > > > > but > > > > > > wanted > > > > > > to keep backwards compatibility? I don't know which it was. > > > > > > > > > > We implemented it because, as implied above, knowledge of > > > > > whether > > > > > or > > > > > not the change attribute behaves monotonically, or strictly > > > > > monotonically, enables a number of optimisations. > > > > > > > > Of course, but my question was about the value of the old > > > > behavior, > > > > not > > > > about the value of the monotonic behavior. > > > > > > > > Put differently, if we could redesign the protocol from scratch > > > > would > > > > we > > > > actually have included the option of non-monotonic behavior? > > > > > > > > > > If we could design the filesystems from scratch, we probably would > > > not. > > > The protocol ended up being as it is because people were trying to > > > make > > > it as easy to implement as possible. > > > > > > So if we could design the filesystem from scratch, we would have > > > probably designed it along the lines of what AFS does. > > > i.e. each explicit change is accompanied by a single bump of the > > > change > > > attribute, so that the clients can not only decide the order of the > > > resulting changes, but also if they have missed a change (that > > > might > > > have been made by a different client). > > > > > > However that would be a requirement that is likely to be very > > > specific > > > to distributed caches (and hence distributed filesystems). I doubt > > > there are many user space applications that would need that high > > > precision. Maybe MPI, but that's the only candidate I can think of > > > for > > > now? > > > > > > > The fact that NFS kept this more loosely-defined is what allowed us > > to > > elide some of the i_version bumps and regain a fair bit of > > performance > > for local filesystems [1]. If the change attribute had been more > > strictly defined like you mention, then that particular optimization > > would not have been possible. > > > > This sort of thing is why I'm a fan of not defining this any more > > strictly than we require. Later on, maybe we'll come up with a way > > for > > filesystems to advertise that they can offer stronger guarantees. > > What 'eliding of the bumps' are we talking about here? If it results in > unreliable behaviour, then I propose we just drop the whole concept and > go back to using the ctime. The change attribute is only useful if it > results in a reliable mechanism for detecting changes. Once you "elide > away" the word "reliable", then it has no value beyond what ctime > already does. > I'm talking about the scheme to optimize away i_version updates when the current one has never been queried: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f02a9ad1f15d There's nothing unreliable about it.
On Tue, 2022-08-30 at 13:53 -0400, Jeff Layton wrote: > On Tue, 2022-08-30 at 17:47 +0000, Trond Myklebust wrote: > > On Tue, 2022-08-30 at 13:02 -0400, Jeff Layton wrote: > > > On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote: > > > > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote: > > > > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust > > > > > wrote: > > > > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote: > > > > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton > > > > > > > wrote: > > > > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields > > > > > > > > wrote: > > > > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton > > > > > > > > > wrote: > > > > > > > > > > Yes, saying only that it must be different is > > > > > > > > > > intentional. > > > > > > > > > > What > > > > > > > > > > we > > > > > > > > > > really want is for consumers to treat this as an > > > > > > > > > > opaque > > > > > > > > > > value > > > > > > > > > > for the > > > > > > > > > > most part [1]. Therefore an implementation based on > > > > > > > > > > hashing > > > > > > > > > > would > > > > > > > > > > conform to the spec, I'd think, as long as all of > > > > > > > > > > the > > > > > > > > > > relevant > > > > > > > > > > info is > > > > > > > > > > part of the hash. > > > > > > > > > > > > > > > > > > It'd conform, but it might not be as useful as an > > > > > > > > > increasing > > > > > > > > > value. > > > > > > > > > > > > > > > > > > E.g. a client can use that to work out which of a > > > > > > > > > series > > > > > > > > > of > > > > > > > > > reordered > > > > > > > > > write replies is the most recent, and I seem to > > > > > > > > > recall > > > > > > > > > that > > > > > > > > > can > > > > > > > > > prevent > > > > > > > > > unnecessary invalidations in some cases. > > > > > > > > > > > > > > > > > > > > > > > > > That's a good point; the linux client does this. That > > > > > > > > said, > > > > > > > > NFSv4 > > > > > > > > has a > > > > > > > > way for the server to advertise its change attribute > > > > > > > > behavior > > > > > > > > [1] > > > > > > > > (though nfsd hasn't implemented this yet). > > > > > > > > > > > > > > It was implemented and reverted. The issue was that I > > > > > > > thought > > > > > > > nfsd > > > > > > > should mix in the ctime to prevent the change attribute > > > > > > > going > > > > > > > backwards > > > > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), > > > > > > > but > > > > > > > Trond > > > > > > > was > > > > > > > concerned about the possibility of time going backwards. > > > > > > > See > > > > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type > > > > > > > attribute"". > > > > > > > There's some mailing list discussion to that I'm not > > > > > > > turning > > > > > > > up > > > > > > > right > > > > > > > now. > > > > > > > > > > https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/ > > > > > is what I was thinking of but it isn't actually that > > > > > interesting. > > > > > > > > > > > My main concern was that some filesystems (e.g. ext3) were > > > > > > failing > > > > > > to > > > > > > provide sufficient timestamp resolution to actually label > > > > > > the > > > > > > resulting > > > > > > 'change attribute' as being updated monotonically. If the > > > > > > time > > > > > > stamp > > > > > > doesn't change when the file data or metadata are changed, > > > > > > then > > > > > > the > > > > > > client has to perform extra checks to try to figure out > > > > > > whether > > > > > > or > > > > > > not > > > > > > its caches are up to date. > > > > > > > > > > That's a different issue from the one you were raising in > > > > > that > > > > > discussion. > > > > > > > > > > > > Did NFSv4 add change_attr_type because some > > > > > > > implementations > > > > > > > needed > > > > > > > the > > > > > > > unordered case, or because they realized ordering was > > > > > > > useful > > > > > > > but > > > > > > > wanted > > > > > > > to keep backwards compatibility? I don't know which it > > > > > > > was. > > > > > > > > > > > > We implemented it because, as implied above, knowledge of > > > > > > whether > > > > > > or > > > > > > not the change attribute behaves monotonically, or strictly > > > > > > monotonically, enables a number of optimisations. > > > > > > > > > > Of course, but my question was about the value of the old > > > > > behavior, > > > > > not > > > > > about the value of the monotonic behavior. > > > > > > > > > > Put differently, if we could redesign the protocol from > > > > > scratch > > > > > would > > > > > we > > > > > actually have included the option of non-monotonic behavior? > > > > > > > > > > > > > If we could design the filesystems from scratch, we probably > > > > would > > > > not. > > > > The protocol ended up being as it is because people were trying > > > > to > > > > make > > > > it as easy to implement as possible. > > > > > > > > So if we could design the filesystem from scratch, we would > > > > have > > > > probably designed it along the lines of what AFS does. > > > > i.e. each explicit change is accompanied by a single bump of > > > > the > > > > change > > > > attribute, so that the clients can not only decide the order of > > > > the > > > > resulting changes, but also if they have missed a change (that > > > > might > > > > have been made by a different client). > > > > > > > > However that would be a requirement that is likely to be very > > > > specific > > > > to distributed caches (and hence distributed filesystems). I > > > > doubt > > > > there are many user space applications that would need that > > > > high > > > > precision. Maybe MPI, but that's the only candidate I can think > > > > of > > > > for > > > > now? > > > > > > > > > > The fact that NFS kept this more loosely-defined is what allowed > > > us > > > to > > > elide some of the i_version bumps and regain a fair bit of > > > performance > > > for local filesystems [1]. If the change attribute had been more > > > strictly defined like you mention, then that particular > > > optimization > > > would not have been possible. > > > > > > This sort of thing is why I'm a fan of not defining this any more > > > strictly than we require. Later on, maybe we'll come up with a > > > way > > > for > > > filesystems to advertise that they can offer stronger guarantees. > > > > What 'eliding of the bumps' are we talking about here? If it > > results in > > unreliable behaviour, then I propose we just drop the whole concept > > and > > go back to using the ctime. The change attribute is only useful if > > it > > results in a reliable mechanism for detecting changes. Once you > > "elide > > away" the word "reliable", then it has no value beyond what ctime > > already does. > > > > I'm talking about the scheme to optimize away i_version updates when > the > current one has never been queried: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f02a9ad1f15d > > There's nothing unreliable about it. Not really seeing why that would be incompatible with the idea of bumping on every change. The I_VERSION_QUERIED is just a hint to tell you that at the very least you need to sync the next metadata update after someone peeked at the value. You could still continue to cache updates after that, and only sync them once a O_SYNC or an fsync() call explicitly requires you to do so.
On Tue, Aug 30, 2022 at 01:02:50PM -0400, Jeff Layton wrote: > The fact that NFS kept this more loosely-defined is what allowed us to > elide some of the i_version bumps and regain a fair bit of performance > for local filesystems [1]. If the change attribute had been more > strictly defined like you mention, then that particular optimization > would not have been possible. > > This sort of thing is why I'm a fan of not defining this any more > strictly than we require. Later on, maybe we'll come up with a way for > filesystems to advertise that they can offer stronger guarantees. Yeah, the afs change-attribute-as-counter thing seems ambitious--I wouldn't even know how to define what exactly you're counting. My one question is whether it'd be worth just defining the thing as *increasing*. That's a lower bar. (Though admittedly we don't quite manage it now--see again 1631087ba872 "Revert "nfsd4: support change_attr_type attribute"".) --b.
On Tue, 2022-08-30 at 18:25 +0000, Trond Myklebust wrote: > On Tue, 2022-08-30 at 13:53 -0400, Jeff Layton wrote: > > On Tue, 2022-08-30 at 17:47 +0000, Trond Myklebust wrote: > > > On Tue, 2022-08-30 at 13:02 -0400, Jeff Layton wrote: > > > > On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote: > > > > > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote: > > > > > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust > > > > > > wrote: > > > > > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote: > > > > > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton > > > > > > > > wrote: > > > > > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields > > > > > > > > > wrote: > > > > > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton > > > > > > > > > > wrote: > > > > > > > > > > > Yes, saying only that it must be different is > > > > > > > > > > > intentional. > > > > > > > > > > > What > > > > > > > > > > > we > > > > > > > > > > > really want is for consumers to treat this as an > > > > > > > > > > > opaque > > > > > > > > > > > value > > > > > > > > > > > for the > > > > > > > > > > > most part [1]. Therefore an implementation based on > > > > > > > > > > > hashing > > > > > > > > > > > would > > > > > > > > > > > conform to the spec, I'd think, as long as all of > > > > > > > > > > > the > > > > > > > > > > > relevant > > > > > > > > > > > info is > > > > > > > > > > > part of the hash. > > > > > > > > > > > > > > > > > > > > It'd conform, but it might not be as useful as an > > > > > > > > > > increasing > > > > > > > > > > value. > > > > > > > > > > > > > > > > > > > > E.g. a client can use that to work out which of a > > > > > > > > > > series > > > > > > > > > > of > > > > > > > > > > reordered > > > > > > > > > > write replies is the most recent, and I seem to > > > > > > > > > > recall > > > > > > > > > > that > > > > > > > > > > can > > > > > > > > > > prevent > > > > > > > > > > unnecessary invalidations in some cases. > > > > > > > > > > > > > > > > > > > > > > > > > > > > That's a good point; the linux client does this. That > > > > > > > > > said, > > > > > > > > > NFSv4 > > > > > > > > > has a > > > > > > > > > way for the server to advertise its change attribute > > > > > > > > > behavior > > > > > > > > > [1] > > > > > > > > > (though nfsd hasn't implemented this yet). > > > > > > > > > > > > > > > > It was implemented and reverted. The issue was that I > > > > > > > > thought > > > > > > > > nfsd > > > > > > > > should mix in the ctime to prevent the change attribute > > > > > > > > going > > > > > > > > backwards > > > > > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), > > > > > > > > but > > > > > > > > Trond > > > > > > > > was > > > > > > > > concerned about the possibility of time going backwards. > > > > > > > > See > > > > > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type > > > > > > > > attribute"". > > > > > > > > There's some mailing list discussion to that I'm not > > > > > > > > turning > > > > > > > > up > > > > > > > > right > > > > > > > > now. > > > > > > > > > > > > https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/ > > > > > > is what I was thinking of but it isn't actually that > > > > > > interesting. > > > > > > > > > > > > > My main concern was that some filesystems (e.g. ext3) were > > > > > > > failing > > > > > > > to > > > > > > > provide sufficient timestamp resolution to actually label > > > > > > > the > > > > > > > resulting > > > > > > > 'change attribute' as being updated monotonically. If the > > > > > > > time > > > > > > > stamp > > > > > > > doesn't change when the file data or metadata are changed, > > > > > > > then > > > > > > > the > > > > > > > client has to perform extra checks to try to figure out > > > > > > > whether > > > > > > > or > > > > > > > not > > > > > > > its caches are up to date. > > > > > > > > > > > > That's a different issue from the one you were raising in > > > > > > that > > > > > > discussion. > > > > > > > > > > > > > > Did NFSv4 add change_attr_type because some > > > > > > > > implementations > > > > > > > > needed > > > > > > > > the > > > > > > > > unordered case, or because they realized ordering was > > > > > > > > useful > > > > > > > > but > > > > > > > > wanted > > > > > > > > to keep backwards compatibility? I don't know which it > > > > > > > > was. > > > > > > > > > > > > > > We implemented it because, as implied above, knowledge of > > > > > > > whether > > > > > > > or > > > > > > > not the change attribute behaves monotonically, or strictly > > > > > > > monotonically, enables a number of optimisations. > > > > > > > > > > > > Of course, but my question was about the value of the old > > > > > > behavior, > > > > > > not > > > > > > about the value of the monotonic behavior. > > > > > > > > > > > > Put differently, if we could redesign the protocol from > > > > > > scratch > > > > > > would > > > > > > we > > > > > > actually have included the option of non-monotonic behavior? > > > > > > > > > > > > > > > > If we could design the filesystems from scratch, we probably > > > > > would > > > > > not. > > > > > The protocol ended up being as it is because people were trying > > > > > to > > > > > make > > > > > it as easy to implement as possible. > > > > > > > > > > So if we could design the filesystem from scratch, we would > > > > > have > > > > > probably designed it along the lines of what AFS does. > > > > > i.e. each explicit change is accompanied by a single bump of > > > > > the > > > > > change > > > > > attribute, so that the clients can not only decide the order of > > > > > the > > > > > resulting changes, but also if they have missed a change (that > > > > > might > > > > > have been made by a different client). > > > > > > > > > > However that would be a requirement that is likely to be very > > > > > specific > > > > > to distributed caches (and hence distributed filesystems). I > > > > > doubt > > > > > there are many user space applications that would need that > > > > > high > > > > > precision. Maybe MPI, but that's the only candidate I can think > > > > > of > > > > > for > > > > > now? > > > > > > > > > > > > > The fact that NFS kept this more loosely-defined is what allowed > > > > us > > > > to > > > > elide some of the i_version bumps and regain a fair bit of > > > > performance > > > > for local filesystems [1]. If the change attribute had been more > > > > strictly defined like you mention, then that particular > > > > optimization > > > > would not have been possible. > > > > > > > > This sort of thing is why I'm a fan of not defining this any more > > > > strictly than we require. Later on, maybe we'll come up with a > > > > way > > > > for > > > > filesystems to advertise that they can offer stronger guarantees. > > > > > > What 'eliding of the bumps' are we talking about here? If it > > > results in > > > unreliable behaviour, then I propose we just drop the whole concept > > > and > > > go back to using the ctime. The change attribute is only useful if > > > it > > > results in a reliable mechanism for detecting changes. Once you > > > "elide > > > away" the word "reliable", then it has no value beyond what ctime > > > already does. > > > > > > > I'm talking about the scheme to optimize away i_version updates when > > the > > current one has never been queried: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f02a9ad1f15d > > > > There's nothing unreliable about it. > > Not really seeing why that would be incompatible with the idea of > bumping on every change. The I_VERSION_QUERIED is just a hint to tell > you that at the very least you need to sync the next metadata update > after someone peeked at the value. You could still continue to cache > updates after that, and only sync them once a O_SYNC or an fsync() call > explicitly requires you to do so. > Good point! It's not implemented that way today, but we could change it to do that if it were useful. I think it'd be slightly more costly CPU-wise when the update isn't going to disk, since you'd now have to update the value on every change instead of just skipping it, but I doubt anyone would notice the extra overhead.
On Tue, 2022-08-30 at 14:32 -0400, J. Bruce Fields wrote: > On Tue, Aug 30, 2022 at 01:02:50PM -0400, Jeff Layton wrote: > > The fact that NFS kept this more loosely-defined is what allowed us to > > elide some of the i_version bumps and regain a fair bit of performance > > for local filesystems [1]. If the change attribute had been more > > strictly defined like you mention, then that particular optimization > > would not have been possible. > > > > This sort of thing is why I'm a fan of not defining this any more > > strictly than we require. Later on, maybe we'll come up with a way for > > filesystems to advertise that they can offer stronger guarantees. > > Yeah, the afs change-attribute-as-counter thing seems ambitious--I > wouldn't even know how to define what exactly you're counting. > > My one question is whether it'd be worth just defining the thing as > *increasing*. That's a lower bar. > That's a very good question. One could argue that NFSv4 sort of requires that for write delegations anyway. All of the existing implementations that I know of do this, so that wouldn't rule any of them out. I'm not opposed to adding that constraint. Let me think on it a bit more. > (Though admittedly we don't quite manage it now--see again 1631087ba872 > "Revert "nfsd4: support change_attr_type attribute"".) > Factoring the ctime into the change attr seems wrong, since a clock jump could make it go backward. Do you remember what drove that change (see 630458e730b8) ? It seems like if the i_version were to go backward, then the ctime probably would too, and you'd still see a duplicate change attr.
On Tue, Aug 30, 2022 at 03:30:13PM -0400, Jeff Layton wrote: > On Tue, 2022-08-30 at 14:32 -0400, J. Bruce Fields wrote: > > On Tue, Aug 30, 2022 at 01:02:50PM -0400, Jeff Layton wrote: > > > The fact that NFS kept this more loosely-defined is what allowed us to > > > elide some of the i_version bumps and regain a fair bit of performance > > > for local filesystems [1]. If the change attribute had been more > > > strictly defined like you mention, then that particular optimization > > > would not have been possible. > > > > > > This sort of thing is why I'm a fan of not defining this any more > > > strictly than we require. Later on, maybe we'll come up with a way for > > > filesystems to advertise that they can offer stronger guarantees. > > > > Yeah, the afs change-attribute-as-counter thing seems ambitious--I > > wouldn't even know how to define what exactly you're counting. > > > > My one question is whether it'd be worth just defining the thing as > > *increasing*. That's a lower bar. > > > > That's a very good question. > > One could argue that NFSv4 sort of requires that for write delegations > anyway. All of the existing implementations that I know of do this, so > that wouldn't rule any of them out. > > I'm not opposed to adding that constraint. Let me think on it a bit > more. > > > (Though admittedly we don't quite manage it now--see again 1631087ba872 > > "Revert "nfsd4: support change_attr_type attribute"".) > > > > Factoring the ctime into the change attr seems wrong, since a clock jump > could make it go backward. Do you remember what drove that change (see > 630458e730b8) ? > > It seems like if the i_version were to go backward, then the ctime > probably would too, and you'd still see a duplicate change attr. See the comment--I was worried about crashes: the change attribute isn't on disk at the time the client requests it, so after a crash the client may see it go backward. (And then could see it repeat a value, possibly with different file contents.) Combining it with the ctime means we get something that behaves correctly even in that case--unless the clock goes backwards. --b.
On Tue, 2022-08-30 at 15:46 -0400, J. Bruce Fields wrote: > On Tue, Aug 30, 2022 at 03:30:13PM -0400, Jeff Layton wrote: > > On Tue, 2022-08-30 at 14:32 -0400, J. Bruce Fields wrote: > > > On Tue, Aug 30, 2022 at 01:02:50PM -0400, Jeff Layton wrote: > > > > The fact that NFS kept this more loosely-defined is what allowed us to > > > > elide some of the i_version bumps and regain a fair bit of performance > > > > for local filesystems [1]. If the change attribute had been more > > > > strictly defined like you mention, then that particular optimization > > > > would not have been possible. > > > > > > > > This sort of thing is why I'm a fan of not defining this any more > > > > strictly than we require. Later on, maybe we'll come up with a way for > > > > filesystems to advertise that they can offer stronger guarantees. > > > > > > Yeah, the afs change-attribute-as-counter thing seems ambitious--I > > > wouldn't even know how to define what exactly you're counting. > > > > > > My one question is whether it'd be worth just defining the thing as > > > *increasing*. That's a lower bar. > > > > > > > That's a very good question. > > > > One could argue that NFSv4 sort of requires that for write delegations > > anyway. All of the existing implementations that I know of do this, so > > that wouldn't rule any of them out. > > > > I'm not opposed to adding that constraint. Let me think on it a bit > > more. > > > > > (Though admittedly we don't quite manage it now--see again 1631087ba872 > > > "Revert "nfsd4: support change_attr_type attribute"".) > > > > > > > Factoring the ctime into the change attr seems wrong, since a clock jump > > could make it go backward. Do you remember what drove that change (see > > 630458e730b8) ? > > > > It seems like if the i_version were to go backward, then the ctime > > probably would too, and you'd still see a duplicate change attr. > > See the comment--I was worried about crashes: the change attribute isn't > on disk at the time the client requests it, so after a crash the client > may see it go backward. (And then could see it repeat a value, possibly > with different file contents.) > > Combining it with the ctime means we get something that behaves > correctly even in that case--unless the clock goes backwards. > Yeah ok, I vaguely remember discussing this. That seems like it has its own problem though. If you mix in the ctime and the clock jumps backward, then you could end up with the same issue (a stale changeid, different contents). No crash required.
On Tue, Aug 30, 2022 at 03:57:09PM -0400, Jeff Layton wrote: > On Tue, 2022-08-30 at 15:46 -0400, J. Bruce Fields wrote: > > On Tue, Aug 30, 2022 at 03:30:13PM -0400, Jeff Layton wrote: > > > On Tue, 2022-08-30 at 14:32 -0400, J. Bruce Fields wrote: > > > > On Tue, Aug 30, 2022 at 01:02:50PM -0400, Jeff Layton wrote: > > > > > The fact that NFS kept this more loosely-defined is what allowed us to > > > > > elide some of the i_version bumps and regain a fair bit of performance > > > > > for local filesystems [1]. If the change attribute had been more > > > > > strictly defined like you mention, then that particular optimization > > > > > would not have been possible. > > > > > > > > > > This sort of thing is why I'm a fan of not defining this any more > > > > > strictly than we require. Later on, maybe we'll come up with a way for > > > > > filesystems to advertise that they can offer stronger guarantees. > > > > > > > > Yeah, the afs change-attribute-as-counter thing seems ambitious--I > > > > wouldn't even know how to define what exactly you're counting. > > > > > > > > My one question is whether it'd be worth just defining the thing as > > > > *increasing*. That's a lower bar. > > > > > > > > > > That's a very good question. > > > > > > One could argue that NFSv4 sort of requires that for write delegations > > > anyway. All of the existing implementations that I know of do this, so > > > that wouldn't rule any of them out. > > > > > > I'm not opposed to adding that constraint. Let me think on it a bit > > > more. > > > > > > > (Though admittedly we don't quite manage it now--see again 1631087ba872 > > > > "Revert "nfsd4: support change_attr_type attribute"".) > > > > > > > > > > Factoring the ctime into the change attr seems wrong, since a clock jump > > > could make it go backward. Do you remember what drove that change (see > > > 630458e730b8) ? > > > > > > It seems like if the i_version were to go backward, then the ctime > > > probably would too, and you'd still see a duplicate change attr. > > > > See the comment--I was worried about crashes: the change attribute isn't > > on disk at the time the client requests it, so after a crash the client > > may see it go backward. (And then could see it repeat a value, possibly > > with different file contents.) > > > > Combining it with the ctime means we get something that behaves > > correctly even in that case--unless the clock goes backwards. > > > > Yeah ok, I vaguely remember discussing this. > > That seems like it has its own problem though. If you mix in the ctime > and the clock jumps backward, then you could end up with the same issue > (a stale changeid, different contents). No crash required. Yes, exactly. My feeling was that I've got no control over power failures and such, but the clock setting is something I can get right. So I'd rather have something that depended on the latter than the former. I could be wrong. --b.
diff --git a/include/linux/iversion.h b/include/linux/iversion.h index 3bfebde5a1a6..45e93e1b4edc 100644 --- a/include/linux/iversion.h +++ b/include/linux/iversion.h @@ -9,8 +9,19 @@ * --------------------------- * The change attribute (i_version) is mandated by NFSv4 and is mostly for * knfsd, but is also used for other purposes (e.g. IMA). The i_version must - * appear different to observers if there was a change to the inode's data or - * metadata since it was last queried. + * appear different to observers if there was an explicit change to the inode's + * data or metadata since it was last queried. + * + * An explicit change is one that would ordinarily result in a change to the + * inode status change time (aka ctime). The version must appear to change, even + * if the ctime does not (since the whole point is to avoid missing updates due + * to timestamp granularity). If POSIX mandates that the ctime must change due + * to an operation, then the i_version counter must be incremented as well. + * + * A conformant implementation is allowed to increment the counter in other + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine + * whether caches are up to date. Spurious increments can cause false cache + * invalidations. * * Observers see the i_version as a 64-bit number that never decreases. If it * remains the same since it was last checked, then nothing has changed in the @@ -66,6 +77,14 @@ * Storing the value to disk therefore does not count as a query, so those * filesystems should use inode_peek_iversion to grab the value to be stored. * There is no need to flag the value as having been queried in that case. + * + * Notes on atime updates + * ---------------------- + * Access time (atime) updates due to reads or similar activity do not represent + * an explicit change to the inode data or metadata. If the only change to the + * inode is the atime, then i_version should not be incremented. If an observer + * cares about atime updates, it should plan to fetch and store the atime in + * conjunction with the i_version. */ /*
The i_version field in the kernel has had different semantics over the decades, but we're now proposing to expose it to userland via statx. This means that we need a clear, consistent definition of what it means and when it should change. Update the comments in iversion.h to describe how a conformant i_version implementation is expected to behave. This definition suits the current users of i_version (NFSv4 and IMA), but is loose enough to allow for a wide range of possible implementations. Cc: Colin Walters <walters@verbum.org> Cc: NeilBrown <neilb@suse.de> Cc: Trond Myklebust <trondmy@hammerspace.com> Cc: Dave Chinner <david@fromorbit.com> Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t Signed-off-by: Jeff Layton <jlayton@kernel.org> --- include/linux/iversion.h | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-)