Message ID | 20220907111606.18831-1-jlayton@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [man-pages,RFC,v4] statx, inode: document the new STATX_INO_VERSION field | expand |
On Wed, 07 Sep 2022, Jeff Layton wrote: > I'm proposing to expose the inode change attribute via statx [1]. Document > what this value means and what an observer can infer from it changing. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > [1]: https://lore.kernel.org/linux-nfs/20220826214703.134870-1-jlayton@kernel.org/T/#t > --- > man2/statx.2 | 8 ++++++++ > man7/inode.7 | 39 +++++++++++++++++++++++++++++++++++++++ > 2 files changed, 47 insertions(+) > > v4: add paragraph pointing out the lack of atomicity wrt other changes > > I think these patches are racing with another change to add DIO > alignment info to statx. I imagine this will go in after that, so this > will probably need to be respun to account for contextual differences. > > What I'm mostly interested in here is getting the sematics and > description of the i_version counter nailed down. > > diff --git a/man2/statx.2 b/man2/statx.2 > index 0d1b4591f74c..d98d5148a442 100644 > --- a/man2/statx.2 > +++ b/man2/statx.2 > @@ -62,6 +62,7 @@ struct statx { > __u32 stx_dev_major; /* Major ID */ > __u32 stx_dev_minor; /* Minor ID */ > __u64 stx_mnt_id; /* Mount ID */ > + __u64 stx_ino_version; /* Inode change attribute */ > }; > .EE > .in > @@ -247,6 +248,7 @@ STATX_BTIME Want stx_btime > STATX_ALL The same as STATX_BASIC_STATS | STATX_BTIME. > It is deprecated and should not be used. > STATX_MNT_ID Want stx_mnt_id (since Linux 5.8) > +STATX_INO_VERSION Want stx_ino_version (DRAFT) > .TE > .in > .PP > @@ -407,10 +409,16 @@ This is the same number reported by > .BR name_to_handle_at (2) > and corresponds to the number in the first field in one of the records in > .IR /proc/self/mountinfo . > +.TP > +.I stx_ino_version > +The inode version, also known as the inode change attribute. See > +.BR inode (7) > +for details. > .PP > For further information on the above fields, see > .BR inode (7). > .\" > +.TP > .SS File attributes > The > .I stx_attributes > diff --git a/man7/inode.7 b/man7/inode.7 > index 9b255a890720..8e83836594d8 100644 > --- a/man7/inode.7 > +++ b/man7/inode.7 > @@ -184,6 +184,12 @@ Last status change timestamp (ctime) > This is the file's last status change timestamp. > It is changed by writing or by setting inode information > (i.e., owner, group, link count, mode, etc.). > +.TP > +Inode version (i_version) > +(not returned in the \fIstat\fP structure); \fIstatx.stx_ino_version\fP > +.IP > +This is the inode change counter. See the discussion of > +\fBthe inode version counter\fP, below. > .PP > The timestamp fields report time measured with a zero point at the > .IR Epoch , > @@ -424,6 +430,39 @@ on a directory means that a file > in that directory can be renamed or deleted only by the owner > of the file, by the owner of the directory, and by a privileged > process. > +.SS The inode version counter > +.PP > +The > +.I statx.stx_ino_version > +field is the inode change counter. Any operation that would result in a > +change to \fIstatx.stx_ctime\fP must result in an increase to this value. > +The value must increase even in the case where the ctime change is not > +evident due to coarse timestamp granularity. > +.PP > +An observer cannot infer anything from amount of increase about the > +nature or magnitude of the change. If the returned value is different > +from the last time it was checked, then something has made an explicit > +data and/or metadata change to the inode. > +.PP > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the > +other changes in the inode. On a write, for instance, the i_version it usually > +incremented before the data is copied into the pagecache. Therefore it is > +possible to see a new i_version value while a read still shows the old data. Doesn't that make the value useless? Surely the change number must change no sooner than the change itself is visible, otherwise stale data could be cached indefinitely. If currently implementations behave this way, surely they are broken. NeilBrown
On Wed, Sep 07, 2022 at 09:37:33PM +1000, NeilBrown wrote: > On Wed, 07 Sep 2022, Jeff Layton wrote: > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the > > +other changes in the inode. On a write, for instance, the i_version it usually > > +incremented before the data is copied into the pagecache. Therefore it is > > +possible to see a new i_version value while a read still shows the old data. > > Doesn't that make the value useless? Surely the change number must > change no sooner than the change itself is visible, otherwise stale data > could be cached indefinitely. For the purposes of NFS close-to-open, I guess all we need is for the change attribute increment to happen sometime between the open and the close. But, yes, it'd seem a lot more useful if it was guaranteed to happen after. (Or before and after both--extraneous increments aren't a big problem here.) --b. > > If currently implementations behave this way, surely they are broken. > > NeilBrown
On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > On Wed, 07 Sep 2022, Jeff Layton wrote: > > I'm proposing to expose the inode change attribute via statx [1]. Document > > what this value means and what an observer can infer from it changing. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > [1]: https://lore.kernel.org/linux-nfs/20220826214703.134870-1-jlayton@kernel.org/T/#t > > --- > > man2/statx.2 | 8 ++++++++ > > man7/inode.7 | 39 +++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 47 insertions(+) > > > > v4: add paragraph pointing out the lack of atomicity wrt other changes > > > > I think these patches are racing with another change to add DIO > > alignment info to statx. I imagine this will go in after that, so this > > will probably need to be respun to account for contextual differences. > > > > What I'm mostly interested in here is getting the sematics and > > description of the i_version counter nailed down. > > > > diff --git a/man2/statx.2 b/man2/statx.2 > > index 0d1b4591f74c..d98d5148a442 100644 > > --- a/man2/statx.2 > > +++ b/man2/statx.2 > > @@ -62,6 +62,7 @@ struct statx { > > __u32 stx_dev_major; /* Major ID */ > > __u32 stx_dev_minor; /* Minor ID */ > > __u64 stx_mnt_id; /* Mount ID */ > > + __u64 stx_ino_version; /* Inode change attribute */ > > }; > > .EE > > .in > > @@ -247,6 +248,7 @@ STATX_BTIME Want stx_btime > > STATX_ALL The same as STATX_BASIC_STATS | STATX_BTIME. > > It is deprecated and should not be used. > > STATX_MNT_ID Want stx_mnt_id (since Linux 5.8) > > +STATX_INO_VERSION Want stx_ino_version (DRAFT) > > .TE > > .in > > .PP > > @@ -407,10 +409,16 @@ This is the same number reported by > > .BR name_to_handle_at (2) > > and corresponds to the number in the first field in one of the records in > > .IR /proc/self/mountinfo . > > +.TP > > +.I stx_ino_version > > +The inode version, also known as the inode change attribute. See > > +.BR inode (7) > > +for details. > > .PP > > For further information on the above fields, see > > .BR inode (7). > > .\" > > +.TP > > .SS File attributes > > The > > .I stx_attributes > > diff --git a/man7/inode.7 b/man7/inode.7 > > index 9b255a890720..8e83836594d8 100644 > > --- a/man7/inode.7 > > +++ b/man7/inode.7 > > @@ -184,6 +184,12 @@ Last status change timestamp (ctime) > > This is the file's last status change timestamp. > > It is changed by writing or by setting inode information > > (i.e., owner, group, link count, mode, etc.). > > +.TP > > +Inode version (i_version) > > +(not returned in the \fIstat\fP structure); \fIstatx.stx_ino_version\fP > > +.IP > > +This is the inode change counter. See the discussion of > > +\fBthe inode version counter\fP, below. > > .PP > > The timestamp fields report time measured with a zero point at the > > .IR Epoch , > > @@ -424,6 +430,39 @@ on a directory means that a file > > in that directory can be renamed or deleted only by the owner > > of the file, by the owner of the directory, and by a privileged > > process. > > +.SS The inode version counter > > +.PP > > +The > > +.I statx.stx_ino_version > > +field is the inode change counter. Any operation that would result in a > > +change to \fIstatx.stx_ctime\fP must result in an increase to this value. > > +The value must increase even in the case where the ctime change is not > > +evident due to coarse timestamp granularity. > > +.PP > > +An observer cannot infer anything from amount of increase about the > > +nature or magnitude of the change. If the returned value is different > > +from the last time it was checked, then something has made an explicit > > +data and/or metadata change to the inode. > > +.PP > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the > > +other changes in the inode. On a write, for instance, the i_version it usually > > +incremented before the data is copied into the pagecache. Therefore it is > > +possible to see a new i_version value while a read still shows the old data. > > Doesn't that make the value useless? > No, I don't think so. It's only really useful for comparing to an older sample anyway. If you do "statx; read; statx" and the value hasn't changed, then you know that things are stable. > Surely the change number must > change no sooner than the change itself is visible, otherwise stale data > could be cached indefinitely. > > If currently implementations behave this way, surely they are broken. It's certainly not ideal but we've never been able to offer truly atomic behavior here given that Linux is a general-purpose OS. The behavior is a little inconsistent too: The c/mtime update and i_version bump on directories (mostly) occur after the operation. c/mtime updates for files however are mostly driven by calls to file_update_time, which happens before data is copied to the pagecache. It's not clear to me why it's done this way. Maybe to ensure that the metadata is up to date in the event that a statx comes in? Improving this would be nice, but I don't see a way to do that without regressing performance.
On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the > > > +other changes in the inode. On a write, for instance, the i_version it usually > > > +incremented before the data is copied into the pagecache. Therefore it is > > > +possible to see a new i_version value while a read still shows the old data. > > > > Doesn't that make the value useless? > > > > No, I don't think so. It's only really useful for comparing to an older > sample anyway. If you do "statx; read; statx" and the value hasn't > changed, then you know that things are stable. I don't see how that helps. It's still possible to get: reader writer ------ ------ i_version++ statx read statx update page cache right? --b. > > > Surely the change number must > > change no sooner than the change itself is visible, otherwise stale data > > could be cached indefinitely. > > > > If currently implementations behave this way, surely they are broken. > > It's certainly not ideal but we've never been able to offer truly atomic > behavior here given that Linux is a general-purpose OS. The behavior is > a little inconsistent too: > > The c/mtime update and i_version bump on directories (mostly) occur > after the operation. c/mtime updates for files however are mostly driven > by calls to file_update_time, which happens before data is copied to the > pagecache. > > It's not clear to me why it's done this way. Maybe to ensure that the > metadata is up to date in the event that a statx comes in? Improving > this would be nice, but I don't see a way to do that without regressing > performance. > -- > Jeff Layton <jlayton@kernel.org>
On Wed, 2022-09-07 at 08:20 -0400, J. Bruce Fields wrote: > On Wed, Sep 07, 2022 at 09:37:33PM +1000, NeilBrown wrote: > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the > > > +other changes in the inode. On a write, for instance, the i_version it usually > > > +incremented before the data is copied into the pagecache. Therefore it is > > > +possible to see a new i_version value while a read still shows the old data. > > > > Doesn't that make the value useless? Surely the change number must > > change no sooner than the change itself is visible, otherwise stale data > > could be cached indefinitely. > > For the purposes of NFS close-to-open, I guess all we need is for the > change attribute increment to happen sometime between the open and the > close. > > But, yes, it'd seem a lot more useful if it was guaranteed to happen > after. (Or before and after both--extraneous increments aren't a big > problem here.) > > For NFS I don't think they would be. We don't want increments due to reads that may happen well after the initial write, but as long as the second increment comes in fairly soon after the initial one, the extra invalidations shouldn't be _too_ bad. You might have a reader race in and see the interim value, but we'd probably want the reader to invalidate the cache soon after that anyway. The file was clearly in flux at the time of the read. Allowing for this sort of thing is why I've been advocating against trying to define this value too strictly. If we were to confine ourselves to "one bump per change" then it'd be hard to pull this off. Maybe this is what we should be doing? > > > > If currently implementations behave this way, surely they are broken. > >
On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the > > > > +other changes in the inode. On a write, for instance, the i_version it usually > > > > +incremented before the data is copied into the pagecache. Therefore it is > > > > +possible to see a new i_version value while a read still shows the old data. > > > > > > Doesn't that make the value useless? > > > > > > > No, I don't think so. It's only really useful for comparing to an older > > sample anyway. If you do "statx; read; statx" and the value hasn't > > changed, then you know that things are stable. > > I don't see how that helps. It's still possible to get: > > reader writer > ------ ------ > i_version++ > statx > read > statx > update page cache > > right? > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In that case, maybe this is useless then other than for testing purposes and userland NFS servers. Would it be better to not consume a statx field with this if so? What could we use as an alternate interface? ioctl? Some sort of global virtual xattr? It does need to be something per-inode. > > > > > Surely the change number must > > > change no sooner than the change itself is visible, otherwise stale data > > > could be cached indefinitely. > > > > > > If currently implementations behave this way, surely they are broken. > > > > It's certainly not ideal but we've never been able to offer truly atomic > > behavior here given that Linux is a general-purpose OS. The behavior is > > a little inconsistent too: > > > > The c/mtime update and i_version bump on directories (mostly) occur > > after the operation. c/mtime updates for files however are mostly driven > > by calls to file_update_time, which happens before data is copied to the > > pagecache. > > > > It's not clear to me why it's done this way. Maybe to ensure that the > > metadata is up to date in the event that a statx comes in? Improving > > this would be nice, but I don't see a way to do that without regressing > > performance. > > -- > > Jeff Layton <jlayton@kernel.org>
On Wed 07-09-22 09:12:34, Jeff Layton wrote: > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the > > > > > +other changes in the inode. On a write, for instance, the i_version it usually > > > > > +incremented before the data is copied into the pagecache. Therefore it is > > > > > +possible to see a new i_version value while a read still shows the old data. > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > No, I don't think so. It's only really useful for comparing to an older > > > sample anyway. If you do "statx; read; statx" and the value hasn't > > > changed, then you know that things are stable. > > > > I don't see how that helps. It's still possible to get: > > > > reader writer > > ------ ------ > > i_version++ > > statx > > read > > statx > > update page cache > > > > right? > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In > that case, maybe this is useless then other than for testing purposes > and userland NFS servers. > > Would it be better to not consume a statx field with this if so? What > could we use as an alternate interface? ioctl? Some sort of global > virtual xattr? It does need to be something per-inode. I was thinking how hard would it be to increment i_version after updating data but it will be rather hairy. In particular because of stuff like IOCB_NOWAIT support which needs to bail if i_version update is needed. So yeah, I don't think there's an easy way how to provide useful i_version for general purpose use. Honza
On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with > > > > > respect to the > > > > > +other changes in the inode. On a write, for instance, the > > > > > i_version it usually > > > > > +incremented before the data is copied into the pagecache. > > > > > Therefore it is > > > > > +possible to see a new i_version value while a read still > > > > > shows the old data. > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > No, I don't think so. It's only really useful for comparing to an > > > older > > > sample anyway. If you do "statx; read; statx" and the value > > > hasn't > > > changed, then you know that things are stable. > > > > I don't see how that helps. It's still possible to get: > > > > reader writer > > ------ ------ > > i_version++ > > statx > > read > > statx > > update page cache > > > > right? > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In > that case, maybe this is useless then other than for testing purposes > and userland NFS servers. > > Would it be better to not consume a statx field with this if so? What > could we use as an alternate interface? ioctl? Some sort of global > virtual xattr? It does need to be something per-inode. I don't see how a non-atomic change attribute is remotely useful even for NFS. The main problem is not so much the above (although NFS clients are vulnerable to that too) but the behaviour w.r.t. directory changes. If the server can't guarantee that file/directory/... creation and unlink are atomically recorded with change attribute updates, then the client has to always assume that the server is lying, and that it has to revalidate all its caches anyway. Cue endless readdir/lookup/getattr requests after each and every directory modification in order to check that some other client didn't also sneak in a change of their own.
On Wed, 2022-09-07 at 13:55 +0000, Trond Myklebust wrote: > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with > > > > > > respect to the > > > > > > +other changes in the inode. On a write, for instance, the > > > > > > i_version it usually > > > > > > +incremented before the data is copied into the pagecache. > > > > > > Therefore it is > > > > > > +possible to see a new i_version value while a read still > > > > > > shows the old data. > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing to an > > > > older > > > > sample anyway. If you do "statx; read; statx" and the value > > > > hasn't > > > > changed, then you know that things are stable. > > > > > > I don't see how that helps. It's still possible to get: > > > > > > reader writer > > > ------ ------ > > > i_version++ > > > statx > > > read > > > statx > > > update page cache > > > > > > right? > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In > > that case, maybe this is useless then other than for testing purposes > > and userland NFS servers. > > > > Would it be better to not consume a statx field with this if so? What > > could we use as an alternate interface? ioctl? Some sort of global > > virtual xattr? It does need to be something per-inode. > > I don't see how a non-atomic change attribute is remotely useful even > for NFS. > > The main problem is not so much the above (although NFS clients are > vulnerable to that too) but the behaviour w.r.t. directory changes. > > If the server can't guarantee that file/directory/... creation and > unlink are atomically recorded with change attribute updates, then the > client has to always assume that the server is lying, and that it has > to revalidate all its caches anyway. Cue endless readdir/lookup/getattr > requests after each and every directory modification in order to check > that some other client didn't also sneak in a change of their own. > We generally hold the parent dir's inode->i_rwsem exclusively over most important directory changes, and the times/i_version are also updated while holding it. What we don't do is serialize reads of this value vs. the i_rwsem, so you could see new directory contents alongside an old i_version. Maybe we should be taking it for read when we query it on a directory? Achieving atomicity with file writes though is another matter entirely. I'm not sure that's even doable or how to approach it if so. Suggestions?
On Wed, 2022-09-07 at 15:51 +0200, Jan Kara wrote: > On Wed 07-09-22 09:12:34, Jeff Layton wrote: > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the > > > > > > +other changes in the inode. On a write, for instance, the i_version it usually > > > > > > +incremented before the data is copied into the pagecache. Therefore it is > > > > > > +possible to see a new i_version value while a read still shows the old data. > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing to an older > > > > sample anyway. If you do "statx; read; statx" and the value hasn't > > > > changed, then you know that things are stable. > > > > > > I don't see how that helps. It's still possible to get: > > > > > > reader writer > > > ------ ------ > > > i_version++ > > > statx > > > read > > > statx > > > update page cache > > > > > > right? > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In > > that case, maybe this is useless then other than for testing purposes > > and userland NFS servers. > > > > Would it be better to not consume a statx field with this if so? What > > could we use as an alternate interface? ioctl? Some sort of global > > virtual xattr? It does need to be something per-inode. > > I was thinking how hard would it be to increment i_version after updating > data but it will be rather hairy. In particular because of stuff like > IOCB_NOWAIT support which needs to bail if i_version update is needed. So > yeah, I don't think there's an easy way how to provide useful i_version for > general purpose use. > Yeah, it does look ugly. Another idea might be to just take the i_rwsem for read in the statx codepath when STATX_INO_VERSION has been requested. xfs, ext4 and btrfs hold the i_rwsem exclusively over their buffered write ops. Doing that should be enough to prevent the race above, I think. The ext4 DAX path also looks ok there. The ext4 DIO write implementation seems to take the i_rwsem for read though unless the size is changing or the write is unaligned. So a i_rwsem readlock would probably not be enough to guard against changes there. Maybe we can just say if you're doing DIO, then don't expect real atomicity wrt i_version? knfsd seems to already hold i_rwsem when doing directory morphing operations (where it fetches the pre and post attrs), but it doesn't take it when calling nfsd4_encode_fattr (which is used to fill out GETATTR and READDIR replies, etc.). We'd probably have to start taking it in those codepaths too. We should also bear in mind that from userland, doing a read of a normal file and fetching the i_version takes two different syscalls. I'm not sure we need things to be truly "atomic", per-se. Whether and how we can exploit that fact, I'm not sure.
On Wed, 2022-09-07 at 10:05 -0400, Jeff Layton wrote: > On Wed, 2022-09-07 at 13:55 +0000, Trond Myklebust wrote: > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic > > > > > > > with > > > > > > > respect to the > > > > > > > +other changes in the inode. On a write, for instance, > > > > > > > the > > > > > > > i_version it usually > > > > > > > +incremented before the data is copied into the > > > > > > > pagecache. > > > > > > > Therefore it is > > > > > > > +possible to see a new i_version value while a read still > > > > > > > shows the old data. > > > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing > > > > > to an > > > > > older > > > > > sample anyway. If you do "statx; read; statx" and the value > > > > > hasn't > > > > > changed, then you know that things are stable. > > > > > > > > I don't see how that helps. It's still possible to get: > > > > > > > > reader writer > > > > ------ ------ > > > > i_version++ > > > > statx > > > > read > > > > statx > > > > update page cache > > > > > > > > right? > > > > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. > > > In > > > that case, maybe this is useless then other than for testing > > > purposes > > > and userland NFS servers. > > > > > > Would it be better to not consume a statx field with this if so? > > > What > > > could we use as an alternate interface? ioctl? Some sort of > > > global > > > virtual xattr? It does need to be something per-inode. > > > > I don't see how a non-atomic change attribute is remotely useful > > even > > for NFS. > > > > The main problem is not so much the above (although NFS clients are > > vulnerable to that too) but the behaviour w.r.t. directory changes. > > > > If the server can't guarantee that file/directory/... creation and > > unlink are atomically recorded with change attribute updates, then > > the > > client has to always assume that the server is lying, and that it > > has > > to revalidate all its caches anyway. Cue endless > > readdir/lookup/getattr > > requests after each and every directory modification in order to > > check > > that some other client didn't also sneak in a change of their own. > > > > We generally hold the parent dir's inode->i_rwsem exclusively over > most > important directory changes, and the times/i_version are also updated > while holding it. What we don't do is serialize reads of this value > vs. > the i_rwsem, so you could see new directory contents alongside an old > i_version. Maybe we should be taking it for read when we query it on > a > directory? Serialising reads is not the problem. The problem is ensuring that knfsd is able to provide an atomic change_info4 structure when the client modifies the directory. i.e. the requirement is that if the directory changed, then that modification is atomically accompanied by an update of the change attribute that can be retrieved by knfsd and placed in the reply to the client. > Achieving atomicity with file writes though is another matter > entirely. > I'm not sure that's even doable or how to approach it if so. > Suggestions? The problem outlined by Bruce above isn't a big deal. Just check the I_VERSION_QUERIED flag after the 'update_page_cache' bit, and bump the i_version if that's the case. The real problem is what happens if you then crash during writeback...
On Wed, 2022-09-07 at 15:04 +0000, Trond Myklebust wrote: > On Wed, 2022-09-07 at 10:05 -0400, Jeff Layton wrote: > > On Wed, 2022-09-07 at 13:55 +0000, Trond Myklebust wrote: > > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic > > > > > > > > with > > > > > > > > respect to the > > > > > > > > +other changes in the inode. On a write, for instance, > > > > > > > > the > > > > > > > > i_version it usually > > > > > > > > +incremented before the data is copied into the > > > > > > > > pagecache. > > > > > > > > Therefore it is > > > > > > > > +possible to see a new i_version value while a read still > > > > > > > > shows the old data. > > > > > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing > > > > > > to an > > > > > > older > > > > > > sample anyway. If you do "statx; read; statx" and the value > > > > > > hasn't > > > > > > changed, then you know that things are stable. > > > > > > > > > > I don't see how that helps. It's still possible to get: > > > > > > > > > > reader writer > > > > > ------ ------ > > > > > i_version++ > > > > > statx > > > > > read > > > > > statx > > > > > update page cache > > > > > > > > > > right? > > > > > > > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. > > > > In > > > > that case, maybe this is useless then other than for testing > > > > purposes > > > > and userland NFS servers. > > > > > > > > Would it be better to not consume a statx field with this if so? > > > > What > > > > could we use as an alternate interface? ioctl? Some sort of > > > > global > > > > virtual xattr? It does need to be something per-inode. > > > > > > I don't see how a non-atomic change attribute is remotely useful > > > even > > > for NFS. > > > > > > The main problem is not so much the above (although NFS clients are > > > vulnerable to that too) but the behaviour w.r.t. directory changes. > > > > > > If the server can't guarantee that file/directory/... creation and > > > unlink are atomically recorded with change attribute updates, then > > > the > > > client has to always assume that the server is lying, and that it > > > has > > > to revalidate all its caches anyway. Cue endless > > > readdir/lookup/getattr > > > requests after each and every directory modification in order to > > > check > > > that some other client didn't also sneak in a change of their own. > > > > > > > We generally hold the parent dir's inode->i_rwsem exclusively over > > most > > important directory changes, and the times/i_version are also updated > > while holding it. What we don't do is serialize reads of this value > > vs. > > the i_rwsem, so you could see new directory contents alongside an old > > i_version. Maybe we should be taking it for read when we query it on > > a > > directory? > > Serialising reads is not the problem. The problem is ensuring that > knfsd is able to provide an atomic change_info4 structure when the > client modifies the directory. > i.e. the requirement is that if the directory changed, then that > modification is atomically accompanied by an update of the change > attribute that can be retrieved by knfsd and placed in the reply to the > client. > I think we already do that for directories today via the i_rwsem. We hold that exclusively over directory-morphing operations, and the i_version is updated while holding that lock. > > Achieving atomicity with file writes though is another matter > > entirely. > > I'm not sure that's even doable or how to approach it if so. > > Suggestions? > > The problem outlined by Bruce above isn't a big deal. Just check the > I_VERSION_QUERIED flag after the 'update_page_cache' bit, and bump the > i_version if that's the case. The real problem is what happens if you > then crash during writeback... > It's a uglier than it looks at first glance. As Jan pointed out, thIt's possible for the initial file_modified call to succeed and then a second one to fail. If the time got an initial update and then the data was copied in, should we fail the write at that point? We may be better served by trying to also do this with the i_rwsem. I'm looking at that now, though it's a bit hairy given that vfs_getattr_nosec can be called either with or without it held.
On Wed, 07 Sep 2022, Trond Myklebust wrote: > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with > > > > > > respect to the > > > > > > +other changes in the inode. On a write, for instance, the > > > > > > i_version it usually > > > > > > +incremented before the data is copied into the pagecache. > > > > > > Therefore it is > > > > > > +possible to see a new i_version value while a read still > > > > > > shows the old data. > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing to an > > > > older > > > > sample anyway. If you do "statx; read; statx" and the value > > > > hasn't > > > > changed, then you know that things are stable. > > > > > > I don't see how that helps. It's still possible to get: > > > > > > reader writer > > > ------ ------ > > > i_version++ > > > statx > > > read > > > statx > > > update page cache > > > > > > right? > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In > > that case, maybe this is useless then other than for testing purposes > > and userland NFS servers. > > > > Would it be better to not consume a statx field with this if so? What > > could we use as an alternate interface? ioctl? Some sort of global > > virtual xattr? It does need to be something per-inode. > > I don't see how a non-atomic change attribute is remotely useful even > for NFS. > > The main problem is not so much the above (although NFS clients are > vulnerable to that too) but the behaviour w.r.t. directory changes. > > If the server can't guarantee that file/directory/... creation and > unlink are atomically recorded with change attribute updates, then the > client has to always assume that the server is lying, and that it has > to revalidate all its caches anyway. Cue endless readdir/lookup/getattr > requests after each and every directory modification in order to check > that some other client didn't also sneak in a change of their own. NFS re-export doesn't support atomic change attributes on directories. Do we see the endless revalidate requests after directory modification in that situation? Just curious. Thanks, NeilBrown
On Thu, 08 Sep 2022, Jeff Layton wrote: > On Wed, 2022-09-07 at 13:55 +0000, Trond Myklebust wrote: > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with > > > > > > > respect to the > > > > > > > +other changes in the inode. On a write, for instance, the > > > > > > > i_version it usually > > > > > > > +incremented before the data is copied into the pagecache. > > > > > > > Therefore it is > > > > > > > +possible to see a new i_version value while a read still > > > > > > > shows the old data. > > > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing to an > > > > > older > > > > > sample anyway. If you do "statx; read; statx" and the value > > > > > hasn't > > > > > changed, then you know that things are stable. > > > > > > > > I don't see how that helps. It's still possible to get: > > > > > > > > reader writer > > > > ------ ------ > > > > i_version++ > > > > statx > > > > read > > > > statx > > > > update page cache > > > > > > > > right? > > > > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In > > > that case, maybe this is useless then other than for testing purposes > > > and userland NFS servers. > > > > > > Would it be better to not consume a statx field with this if so? What > > > could we use as an alternate interface? ioctl? Some sort of global > > > virtual xattr? It does need to be something per-inode. > > > > I don't see how a non-atomic change attribute is remotely useful even > > for NFS. > > > > The main problem is not so much the above (although NFS clients are > > vulnerable to that too) but the behaviour w.r.t. directory changes. > > > > If the server can't guarantee that file/directory/... creation and > > unlink are atomically recorded with change attribute updates, then the > > client has to always assume that the server is lying, and that it has > > to revalidate all its caches anyway. Cue endless readdir/lookup/getattr > > requests after each and every directory modification in order to check > > that some other client didn't also sneak in a change of their own. > > > > We generally hold the parent dir's inode->i_rwsem exclusively over most > important directory changes, and the times/i_version are also updated > while holding it. What we don't do is serialize reads of this value vs. > the i_rwsem, so you could see new directory contents alongside an old > i_version. Maybe we should be taking it for read when we query it on a > directory? We do hold i_rwsem today. I'm working on changing that. Preserving atomic directory changeinfo will be a challenge. The only mechanism I can think if is to pass a "u64*" to all the directory modification ops, and they fill in the version number at the point where it is incremented (inode_maybe_inc_iversion_return()). The (nfsd) caller assumes that "before" was one less than "after". If you don't want to internally require single increments, then you would need to pass a 'u64 [2]' to get two iversions back. > > Achieving atomicity with file writes though is another matter entirely. > I'm not sure that's even doable or how to approach it if so. > Suggestions? Call inode_maybe_inc_version(page->host) in __folio_mark_dirty() ?? NeilBrown
On Thu, 2022-09-08 at 10:31 +1000, NeilBrown wrote: > On Wed, 07 Sep 2022, Trond Myklebust wrote: > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic > > > > > > > with > > > > > > > respect to the > > > > > > > +other changes in the inode. On a write, for instance, > > > > > > > the > > > > > > > i_version it usually > > > > > > > +incremented before the data is copied into the > > > > > > > pagecache. > > > > > > > Therefore it is > > > > > > > +possible to see a new i_version value while a read still > > > > > > > shows the old data. > > > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing > > > > > to an > > > > > older > > > > > sample anyway. If you do "statx; read; statx" and the value > > > > > hasn't > > > > > changed, then you know that things are stable. > > > > > > > > I don't see how that helps. It's still possible to get: > > > > > > > > reader writer > > > > ------ ------ > > > > i_version++ > > > > statx > > > > read > > > > statx > > > > update page cache > > > > > > > > right? > > > > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. > > > In > > > that case, maybe this is useless then other than for testing > > > purposes > > > and userland NFS servers. > > > > > > Would it be better to not consume a statx field with this if so? > > > What > > > could we use as an alternate interface? ioctl? Some sort of > > > global > > > virtual xattr? It does need to be something per-inode. > > > > I don't see how a non-atomic change attribute is remotely useful > > even > > for NFS. > > > > The main problem is not so much the above (although NFS clients are > > vulnerable to that too) but the behaviour w.r.t. directory changes. > > > > If the server can't guarantee that file/directory/... creation and > > unlink are atomically recorded with change attribute updates, then > > the > > client has to always assume that the server is lying, and that it > > has > > to revalidate all its caches anyway. Cue endless > > readdir/lookup/getattr > > requests after each and every directory modification in order to > > check > > that some other client didn't also sneak in a change of their own. > > NFS re-export doesn't support atomic change attributes on > directories. > Do we see the endless revalidate requests after directory > modification > in that situation? Just curious. Why wouldn't NFS re-export be capable of supporting atomic change attributes in those cases, provided that the server does? It seems to me that is just a question of providing the correct information w.r.t. atomicity to knfsd. ...but yes, a quick glance at nfs4_update_changeattr_locked(), and what happens when !cinfo->atomic should tell you all you need to know.
On Wed, 07 Sep 2022, Jan Kara wrote: > On Wed 07-09-22 09:12:34, Jeff Layton wrote: > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the > > > > > > +other changes in the inode. On a write, for instance, the i_version it usually > > > > > > +incremented before the data is copied into the pagecache. Therefore it is > > > > > > +possible to see a new i_version value while a read still shows the old data. > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing to an older > > > > sample anyway. If you do "statx; read; statx" and the value hasn't > > > > changed, then you know that things are stable. > > > > > > I don't see how that helps. It's still possible to get: > > > > > > reader writer > > > ------ ------ > > > i_version++ > > > statx > > > read > > > statx > > > update page cache > > > > > > right? > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In > > that case, maybe this is useless then other than for testing purposes > > and userland NFS servers. > > > > Would it be better to not consume a statx field with this if so? What > > could we use as an alternate interface? ioctl? Some sort of global > > virtual xattr? It does need to be something per-inode. > > I was thinking how hard would it be to increment i_version after updating > data but it will be rather hairy. In particular because of stuff like > IOCB_NOWAIT support which needs to bail if i_version update is needed. So > yeah, I don't think there's an easy way how to provide useful i_version for > general purpose use. > Why cannot IOCB_NOWAIT update i_version? Do we not want to wait on the cmp_xchg loop in inode_maybe_inc_iversion(), or do we not want to trigger an inode update? The first seems unlikely, but the second seems unreasonable. We already acknowledge that after a crash iversion might go backwards and/or miss changes. Thanks, NeilBrown
On Thu, 08 Sep 2022, Trond Myklebust wrote: > On Thu, 2022-09-08 at 10:31 +1000, NeilBrown wrote: > > On Wed, 07 Sep 2022, Trond Myklebust wrote: > > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic > > > > > > > > with > > > > > > > > respect to the > > > > > > > > +other changes in the inode. On a write, for instance, > > > > > > > > the > > > > > > > > i_version it usually > > > > > > > > +incremented before the data is copied into the > > > > > > > > pagecache. > > > > > > > > Therefore it is > > > > > > > > +possible to see a new i_version value while a read still > > > > > > > > shows the old data. > > > > > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing > > > > > > to an > > > > > > older > > > > > > sample anyway. If you do "statx; read; statx" and the value > > > > > > hasn't > > > > > > changed, then you know that things are stable. > > > > > > > > > > I don't see how that helps. It's still possible to get: > > > > > > > > > > reader writer > > > > > ------ ------ > > > > > i_version++ > > > > > statx > > > > > read > > > > > statx > > > > > update page cache > > > > > > > > > > right? > > > > > > > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. > > > > In > > > > that case, maybe this is useless then other than for testing > > > > purposes > > > > and userland NFS servers. > > > > > > > > Would it be better to not consume a statx field with this if so? > > > > What > > > > could we use as an alternate interface? ioctl? Some sort of > > > > global > > > > virtual xattr? It does need to be something per-inode. > > > > > > I don't see how a non-atomic change attribute is remotely useful > > > even > > > for NFS. > > > > > > The main problem is not so much the above (although NFS clients are > > > vulnerable to that too) but the behaviour w.r.t. directory changes. > > > > > > If the server can't guarantee that file/directory/... creation and > > > unlink are atomically recorded with change attribute updates, then > > > the > > > client has to always assume that the server is lying, and that it > > > has > > > to revalidate all its caches anyway. Cue endless > > > readdir/lookup/getattr > > > requests after each and every directory modification in order to > > > check > > > that some other client didn't also sneak in a change of their own. > > > > NFS re-export doesn't support atomic change attributes on > > directories. > > Do we see the endless revalidate requests after directory > > modification > > in that situation? Just curious. > > Why wouldn't NFS re-export be capable of supporting atomic change > attributes in those cases, provided that the server does? It seems to > me that is just a question of providing the correct information w.r.t. > atomicity to knfsd. I don't know if it "could" but as far as I can see the Linux nfsd server doesn't. NFS sets EXPORT_OP_NOATOMIC_ATTR which causes ->fs_no_atomic_attr to be set so cinfo->atomic reported back to the client is always false. > > ...but yes, a quick glance at nfs4_update_changeattr_locked(), and what > happens when !cinfo->atomic should tell you all you need to know. Yep, I can see that all the directory cache is invalidated. I was more wondering if anyone had noticed this causing performance problems. I suspect there are some workloads where is isn't noticeable, and others where it would be quite unpleasant. Chuck said recently: > My impression is that pre/post attributes in NFSv3 have not > turned out to be as useful as their inventors predicted. https://lore.kernel.org/linux-nfs/8F16D957-F43A-4E5B-AA28-AAFCF43222E2@oracle.com/ I wonder how accurate that impression is. Thanks, NeilBrown > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > > >
On Thu 08-09-22 10:44:22, NeilBrown wrote: > On Wed, 07 Sep 2022, Jan Kara wrote: > > On Wed 07-09-22 09:12:34, Jeff Layton wrote: > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the > > > > > > > +other changes in the inode. On a write, for instance, the i_version it usually > > > > > > > +incremented before the data is copied into the pagecache. Therefore it is > > > > > > > +possible to see a new i_version value while a read still shows the old data. > > > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing to an older > > > > > sample anyway. If you do "statx; read; statx" and the value hasn't > > > > > changed, then you know that things are stable. > > > > > > > > I don't see how that helps. It's still possible to get: > > > > > > > > reader writer > > > > ------ ------ > > > > i_version++ > > > > statx > > > > read > > > > statx > > > > update page cache > > > > > > > > right? > > > > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In > > > that case, maybe this is useless then other than for testing purposes > > > and userland NFS servers. > > > > > > Would it be better to not consume a statx field with this if so? What > > > could we use as an alternate interface? ioctl? Some sort of global > > > virtual xattr? It does need to be something per-inode. > > > > I was thinking how hard would it be to increment i_version after updating > > data but it will be rather hairy. In particular because of stuff like > > IOCB_NOWAIT support which needs to bail if i_version update is needed. So > > yeah, I don't think there's an easy way how to provide useful i_version for > > general purpose use. > > > > Why cannot IOCB_NOWAIT update i_version? Do we not want to wait on the > cmp_xchg loop in inode_maybe_inc_iversion(), or do we not want to > trigger an inode update? > > The first seems unlikely, but the second seems unreasonable. We already > acknowledge that after a crash iversion might go backwards and/or miss > changes. It boils down to the fact that we don't want to call mark_inode_dirty() from IOCB_NOWAIT path because for lots of filesystems that means journal operation and there are high chances that may block. Presumably we could treat inode dirtying after i_version change similarly to how we handle timestamp updates with lazytime mount option (i.e., not dirty the inode immediately but only with a delay) but then the time window for i_version inconsistencies due to a crash would be much larger. Honza
On Thu, 2022-09-08 at 10:40 +1000, NeilBrown wrote: > On Thu, 08 Sep 2022, Jeff Layton wrote: > > On Wed, 2022-09-07 at 13:55 +0000, Trond Myklebust wrote: > > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with > > > > > > > > respect to the > > > > > > > > +other changes in the inode. On a write, for instance, the > > > > > > > > i_version it usually > > > > > > > > +incremented before the data is copied into the pagecache. > > > > > > > > Therefore it is > > > > > > > > +possible to see a new i_version value while a read still > > > > > > > > shows the old data. > > > > > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing to an > > > > > > older > > > > > > sample anyway. If you do "statx; read; statx" and the value > > > > > > hasn't > > > > > > changed, then you know that things are stable. > > > > > > > > > > I don't see how that helps. It's still possible to get: > > > > > > > > > > reader writer > > > > > ------ ------ > > > > > i_version++ > > > > > statx > > > > > read > > > > > statx > > > > > update page cache > > > > > > > > > > right? > > > > > > > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In > > > > that case, maybe this is useless then other than for testing purposes > > > > and userland NFS servers. > > > > > > > > Would it be better to not consume a statx field with this if so? What > > > > could we use as an alternate interface? ioctl? Some sort of global > > > > virtual xattr? It does need to be something per-inode. > > > > > > I don't see how a non-atomic change attribute is remotely useful even > > > for NFS. > > > > > > The main problem is not so much the above (although NFS clients are > > > vulnerable to that too) but the behaviour w.r.t. directory changes. > > > > > > If the server can't guarantee that file/directory/... creation and > > > unlink are atomically recorded with change attribute updates, then the > > > client has to always assume that the server is lying, and that it has > > > to revalidate all its caches anyway. Cue endless readdir/lookup/getattr > > > requests after each and every directory modification in order to check > > > that some other client didn't also sneak in a change of their own. > > > > > > > We generally hold the parent dir's inode->i_rwsem exclusively over most > > important directory changes, and the times/i_version are also updated > > while holding it. What we don't do is serialize reads of this value vs. > > the i_rwsem, so you could see new directory contents alongside an old > > i_version. Maybe we should be taking it for read when we query it on a > > directory? > > We do hold i_rwsem today. I'm working on changing that. Preserving > atomic directory changeinfo will be a challenge. The only mechanism I > can think if is to pass a "u64*" to all the directory modification ops, > and they fill in the version number at the point where it is incremented > (inode_maybe_inc_iversion_return()). The (nfsd) caller assumes that > "before" was one less than "after". If you don't want to internally > require single increments, then you would need to pass a 'u64 [2]' to > get two iversions back. > That's a major redesign of what the i_version counter is today. It may very well end up being needed, but that's going to touch a lot of stuff in the VFS. Are you planning to do that as a part of your locking changes? > > > > Achieving atomicity with file writes though is another matter entirely. > > I'm not sure that's even doable or how to approach it if so. > > Suggestions? > > Call inode_maybe_inc_version(page->host) in __folio_mark_dirty() ?? > Writes can cover multiple folios so we'd be doing several increments per write. Maybe that's ok? Should we also be updating the ctime at that point as well? Fetching the i_version under the i_rwsem is probably sufficient to fix this though. Most of the write_iter ops already bump the i_version while holding that lock, so this wouldn't add any extra locking to the write codepaths.
On Thu, 2022-09-08 at 00:41 +0000, Trond Myklebust wrote: > On Thu, 2022-09-08 at 10:31 +1000, NeilBrown wrote: > > On Wed, 07 Sep 2022, Trond Myklebust wrote: > > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic > > > > > > > > with > > > > > > > > respect to the > > > > > > > > +other changes in the inode. On a write, for instance, > > > > > > > > the > > > > > > > > i_version it usually > > > > > > > > +incremented before the data is copied into the > > > > > > > > pagecache. > > > > > > > > Therefore it is > > > > > > > > +possible to see a new i_version value while a read still > > > > > > > > shows the old data. > > > > > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing > > > > > > to an > > > > > > older > > > > > > sample anyway. If you do "statx; read; statx" and the value > > > > > > hasn't > > > > > > changed, then you know that things are stable. > > > > > > > > > > I don't see how that helps. It's still possible to get: > > > > > > > > > > reader writer > > > > > ------ ------ > > > > > i_version++ > > > > > statx > > > > > read > > > > > statx > > > > > update page cache > > > > > > > > > > right? > > > > > > > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. > > > > In > > > > that case, maybe this is useless then other than for testing > > > > purposes > > > > and userland NFS servers. > > > > > > > > Would it be better to not consume a statx field with this if so? > > > > What > > > > could we use as an alternate interface? ioctl? Some sort of > > > > global > > > > virtual xattr? It does need to be something per-inode. > > > > > > I don't see how a non-atomic change attribute is remotely useful > > > even > > > for NFS. > > > > > > The main problem is not so much the above (although NFS clients are > > > vulnerable to that too) but the behaviour w.r.t. directory changes. > > > > > > If the server can't guarantee that file/directory/... creation and > > > unlink are atomically recorded with change attribute updates, then > > > the > > > client has to always assume that the server is lying, and that it > > > has > > > to revalidate all its caches anyway. Cue endless > > > readdir/lookup/getattr > > > requests after each and every directory modification in order to > > > check > > > that some other client didn't also sneak in a change of their own. > > > > NFS re-export doesn't support atomic change attributes on > > directories. > > Do we see the endless revalidate requests after directory > > modification > > in that situation? Just curious. > > Why wouldn't NFS re-export be capable of supporting atomic change > attributes in those cases, provided that the server does? It seems to > me that is just a question of providing the correct information w.r.t. > atomicity to knfsd. > > ...but yes, a quick glance at nfs4_update_changeattr_locked(), and what > happens when !cinfo->atomic should tell you all you need to know. The main reason we disabled atomic change attribute updates was that getattr calls on NFS can be pretty expensive. By setting the NOWCC flag, we can avoid those for WCC info, but at the expense of the client having to do more revalidation on its own.
On Thu, 2022-09-08 at 07:37 -0400, Jeff Layton wrote: > On Thu, 2022-09-08 at 00:41 +0000, Trond Myklebust wrote: > > On Thu, 2022-09-08 at 10:31 +1000, NeilBrown wrote: > > > On Wed, 07 Sep 2022, Trond Myklebust wrote: > > > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton > > > > > > wrote: > > > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not > > > > > > > > > atomic > > > > > > > > > with > > > > > > > > > respect to the > > > > > > > > > +other changes in the inode. On a write, for > > > > > > > > > instance, > > > > > > > > > the > > > > > > > > > i_version it usually > > > > > > > > > +incremented before the data is copied into the > > > > > > > > > pagecache. > > > > > > > > > Therefore it is > > > > > > > > > +possible to see a new i_version value while a read > > > > > > > > > still > > > > > > > > > shows the old data. > > > > > > > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > > > > > > > > > > No, I don't think so. It's only really useful for > > > > > > > comparing > > > > > > > to an > > > > > > > older > > > > > > > sample anyway. If you do "statx; read; statx" and the > > > > > > > value > > > > > > > hasn't > > > > > > > changed, then you know that things are stable. > > > > > > > > > > > > I don't see how that helps. It's still possible to get: > > > > > > > > > > > > reader writer > > > > > > ------ ------ > > > > > > i_version++ > > > > > > statx > > > > > > read > > > > > > statx > > > > > > update page cache > > > > > > > > > > > > right? > > > > > > > > > > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any > > > > > locking. > > > > > In > > > > > that case, maybe this is useless then other than for testing > > > > > purposes > > > > > and userland NFS servers. > > > > > > > > > > Would it be better to not consume a statx field with this if > > > > > so? > > > > > What > > > > > could we use as an alternate interface? ioctl? Some sort of > > > > > global > > > > > virtual xattr? It does need to be something per-inode. > > > > > > > > I don't see how a non-atomic change attribute is remotely > > > > useful > > > > even > > > > for NFS. > > > > > > > > The main problem is not so much the above (although NFS clients > > > > are > > > > vulnerable to that too) but the behaviour w.r.t. directory > > > > changes. > > > > > > > > If the server can't guarantee that file/directory/... creation > > > > and > > > > unlink are atomically recorded with change attribute updates, > > > > then > > > > the > > > > client has to always assume that the server is lying, and that > > > > it > > > > has > > > > to revalidate all its caches anyway. Cue endless > > > > readdir/lookup/getattr > > > > requests after each and every directory modification in order > > > > to > > > > check > > > > that some other client didn't also sneak in a change of their > > > > own. > > > > > > NFS re-export doesn't support atomic change attributes on > > > directories. > > > Do we see the endless revalidate requests after directory > > > modification > > > in that situation? Just curious. > > > > Why wouldn't NFS re-export be capable of supporting atomic change > > attributes in those cases, provided that the server does? It seems > > to > > me that is just a question of providing the correct information > > w.r.t. > > atomicity to knfsd. > > > > ...but yes, a quick glance at nfs4_update_changeattr_locked(), and > > what > > happens when !cinfo->atomic should tell you all you need to know. > > The main reason we disabled atomic change attribute updates was that > getattr calls on NFS can be pretty expensive. By setting the NOWCC > flag, > we can avoid those for WCC info, but at the expense of the client > having > to do more revalidation on its own. While providing WCC attributes on regular files is typically expensive, since it may involve needing to flush out I/O, doing so for directories tends to be a lot less so. The main reason is that all directory operations are synchronous in NFS, and typically do return at least the change attribute when they are modifying the directory contents. So yes, when we re-export NFS as NFSv3, we do want to skip returning WCC attributes for the file. However we usually do our best to return post-op attributes for the directory. Atomicity is a different matter though. Right now the NFS client does set EXPORT_OP_NOATOMIC_ATTR, but we could find ways to work around that for the NFSv4 change attribute at least, if we wanted to.
On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote: > It boils down to the fact that we don't want to call mark_inode_dirty() > from IOCB_NOWAIT path because for lots of filesystems that means journal > operation and there are high chances that may block. > > Presumably we could treat inode dirtying after i_version change similarly > to how we handle timestamp updates with lazytime mount option (i.e., not > dirty the inode immediately but only with a delay) but then the time window > for i_version inconsistencies due to a crash would be much larger. Perhaps this is a radical suggestion, but there seems to be a lot of the problems which are due to the concern "what if the file system crashes" (and so we need to worry about making sure that any increments to i_version MUST be persisted after it is incremented). Well, if we assume that unclean shutdowns are rare, then perhaps we shouldn't be optimizing for that case. So.... what if a file system had a counter which got incremented each time its journal is replayed representing an unclean shutdown. That shouldn't happen often, but if it does, there might be any number of i_version updates that may have gotten lost. So in that case, the NFS client should invalidate all of its caches. If the i_version field was large enough, we could just prefix the "unclean shutdown counter" with the existing i_version number when it is sent over the NFS protocol to the client. But if that field is too small, and if (as I understand things) NFS just needs to know when i_version is different, we could just simply hash the "unclean shtudown counter" with the inode's "i_version counter", and let that be the version which is sent from the NFS client to the server. If we could do that, then it doesn't become critical that every single i_version bump has to be persisted to disk, and we could treat it like a lazytime update; it's guaranteed to updated when we do an clean unmount of the file system (and when the file system is frozen), but on a crash, there is no guaranteee that all i_version bumps will be persisted, but we do have this "unclean shutdown" counter to deal with that case. Would this make life easier for folks? - Ted
On Thu, Sep 08, 2022 at 11:21:49AM -0400, Theodore Ts'o wrote: > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote: > > It boils down to the fact that we don't want to call mark_inode_dirty() > > from IOCB_NOWAIT path because for lots of filesystems that means journal > > operation and there are high chances that may block. > > > > Presumably we could treat inode dirtying after i_version change similarly > > to how we handle timestamp updates with lazytime mount option (i.e., not > > dirty the inode immediately but only with a delay) but then the time window > > for i_version inconsistencies due to a crash would be much larger. > > Perhaps this is a radical suggestion, but there seems to be a lot of > the problems which are due to the concern "what if the file system > crashes" (and so we need to worry about making sure that any > increments to i_version MUST be persisted after it is incremented). > > Well, if we assume that unclean shutdowns are rare, then perhaps we > shouldn't be optimizing for that case. So.... what if a file system > had a counter which got incremented each time its journal is replayed > representing an unclean shutdown. That shouldn't happen often, but if > it does, there might be any number of i_version updates that may have > gotten lost. So in that case, the NFS client should invalidate all of > its caches. > > If the i_version field was large enough, we could just prefix the > "unclean shutdown counter" with the existing i_version number when it > is sent over the NFS protocol to the client. But if that field is too > small, The NFSv4 change attribute is 64 bits. Not sure exactly how to use that, but I think it should be large enough. > and if (as I understand things) NFS just needs to know when > i_version is different, we could just simply hash the "unclean > shtudown counter" with the inode's "i_version counter", and let that > be the version which is sent from the NFS client to the server. Yes, invalidating all caches could be painful, but as you say, also rare. We could also consider factoring the "unclean shutdown counter" on creating (and writing) the new value, instead of on returning it. That would mean it could go backward after a reboot, but at least it would never repeat a previously used value. (Since the new "unclean shutdown counter" will be factored in on first modification of the file after restart.) > If we could do that, then it doesn't become critical that every single > i_version bump has to be persisted to disk, and we could treat it like > a lazytime update; it's guaranteed to updated when we do an clean > unmount of the file system (and when the file system is frozen), but > on a crash, there is no guaranteee that all i_version bumps will be > persisted, but we do have this "unclean shutdown" counter to deal with > that case. > > Would this make life easier for folks? Anyway, yes, seems helpful to me, and not too complicated. (I say, having no idea how to implement the filesystem side.) --b.
On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote: > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote: > > It boils down to the fact that we don't want to call mark_inode_dirty() > > from IOCB_NOWAIT path because for lots of filesystems that means journal > > operation and there are high chances that may block. > > > > Presumably we could treat inode dirtying after i_version change similarly > > to how we handle timestamp updates with lazytime mount option (i.e., not > > dirty the inode immediately but only with a delay) but then the time window > > for i_version inconsistencies due to a crash would be much larger. > > Perhaps this is a radical suggestion, but there seems to be a lot of > the problems which are due to the concern "what if the file system > crashes" (and so we need to worry about making sure that any > increments to i_version MUST be persisted after it is incremented). > > Well, if we assume that unclean shutdowns are rare, then perhaps we > shouldn't be optimizing for that case. So.... what if a file system > had a counter which got incremented each time its journal is replayed > representing an unclean shutdown. That shouldn't happen often, but if > it does, there might be any number of i_version updates that may have > gotten lost. So in that case, the NFS client should invalidate all of > its caches. > > If the i_version field was large enough, we could just prefix the > "unclean shutdown counter" with the existing i_version number when it > is sent over the NFS protocol to the client. But if that field is too > small, and if (as I understand things) NFS just needs to know when > i_version is different, we could just simply hash the "unclean > shtudown counter" with the inode's "i_version counter", and let that > be the version which is sent from the NFS client to the server. > > If we could do that, then it doesn't become critical that every single > i_version bump has to be persisted to disk, and we could treat it like > a lazytime update; it's guaranteed to updated when we do an clean > unmount of the file system (and when the file system is frozen), but > on a crash, there is no guaranteee that all i_version bumps will be > persisted, but we do have this "unclean shutdown" counter to deal with > that case. > > Would this make life easier for folks? > > - Ted Thanks for chiming in, Ted. That's part of the problem, but we're actually not too worried about that case: nfsd mixes the ctime in with i_version, so you'd have to crash+clock jump backward by juuuust enough to allow you to get the i_version and ctime into a state it was before the crash, but with different data. We're assuming that that is difficult to achieve in practice. The issue with a reboot counter (or similar) is that on an unclean crash the NFS client would end up invalidating every inode in the cache, as all of the i_versions would change. That's probably excessive. The bigger issue (at the moment) is atomicity: when we fetch an i_version, the natural inclination is to associate that with the state of the inode at some point in time, so we need this to be updated atomically with certain other attributes of the inode. That's the part I'm trying to sort through at the moment.
On Thu, Sep 08, 2022 at 11:44:33AM -0400, Jeff Layton wrote: > On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote: > > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote: > > > It boils down to the fact that we don't want to call mark_inode_dirty() > > > from IOCB_NOWAIT path because for lots of filesystems that means journal > > > operation and there are high chances that may block. > > > > > > Presumably we could treat inode dirtying after i_version change similarly > > > to how we handle timestamp updates with lazytime mount option (i.e., not > > > dirty the inode immediately but only with a delay) but then the time window > > > for i_version inconsistencies due to a crash would be much larger. > > > > Perhaps this is a radical suggestion, but there seems to be a lot of > > the problems which are due to the concern "what if the file system > > crashes" (and so we need to worry about making sure that any > > increments to i_version MUST be persisted after it is incremented). > > > > Well, if we assume that unclean shutdowns are rare, then perhaps we > > shouldn't be optimizing for that case. So.... what if a file system > > had a counter which got incremented each time its journal is replayed > > representing an unclean shutdown. That shouldn't happen often, but if > > it does, there might be any number of i_version updates that may have > > gotten lost. So in that case, the NFS client should invalidate all of > > its caches. > > > > If the i_version field was large enough, we could just prefix the > > "unclean shutdown counter" with the existing i_version number when it > > is sent over the NFS protocol to the client. But if that field is too > > small, and if (as I understand things) NFS just needs to know when > > i_version is different, we could just simply hash the "unclean > > shtudown counter" with the inode's "i_version counter", and let that > > be the version which is sent from the NFS client to the server. > > > > If we could do that, then it doesn't become critical that every single > > i_version bump has to be persisted to disk, and we could treat it like > > a lazytime update; it's guaranteed to updated when we do an clean > > unmount of the file system (and when the file system is frozen), but > > on a crash, there is no guaranteee that all i_version bumps will be > > persisted, but we do have this "unclean shutdown" counter to deal with > > that case. > > > > Would this make life easier for folks? > > > > - Ted > > Thanks for chiming in, Ted. That's part of the problem, but we're > actually not too worried about that case: > > nfsd mixes the ctime in with i_version, so you'd have to crash+clock > jump backward by juuuust enough to allow you to get the i_version and > ctime into a state it was before the crash, but with different data. > We're assuming that that is difficult to achieve in practice. But a change in the clock could still cause our returned change attribute to go backwards (even without a crash). Not sure how to evaluate the risk, but it was enough that Trond hasn't been comfortable with nfsd advertising NFS4_CHANGE_TYPE_IS_MONOTONIC. Ted's idea would be sufficient to allow us to turn that flag on, which I think allows some client-side optimizations. > The issue with a reboot counter (or similar) is that on an unclean crash > the NFS client would end up invalidating every inode in the cache, as > all of the i_versions would change. That's probably excessive. But if we use the crash counter on write instead of read, we don't invalidate caches unnecessarily. And I think the monotonicity would still be close enough for our purposes? > The bigger issue (at the moment) is atomicity: when we fetch an > i_version, the natural inclination is to associate that with the state > of the inode at some point in time, so we need this to be updated > atomically with certain other attributes of the inode. That's the part > I'm trying to sort through at the moment. That may be, but I still suspect the crash counter would help. --b.
> On Sep 8, 2022, at 11:56 AM, J. Bruce Fields <bfields@fieldses.org> wrote: > > On Thu, Sep 08, 2022 at 11:44:33AM -0400, Jeff Layton wrote: >> On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote: >>> On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote: >>>> It boils down to the fact that we don't want to call mark_inode_dirty() >>>> from IOCB_NOWAIT path because for lots of filesystems that means journal >>>> operation and there are high chances that may block. >>>> >>>> Presumably we could treat inode dirtying after i_version change similarly >>>> to how we handle timestamp updates with lazytime mount option (i.e., not >>>> dirty the inode immediately but only with a delay) but then the time window >>>> for i_version inconsistencies due to a crash would be much larger. >>> >>> Perhaps this is a radical suggestion, but there seems to be a lot of >>> the problems which are due to the concern "what if the file system >>> crashes" (and so we need to worry about making sure that any >>> increments to i_version MUST be persisted after it is incremented). >>> >>> Well, if we assume that unclean shutdowns are rare, then perhaps we >>> shouldn't be optimizing for that case. So.... what if a file system >>> had a counter which got incremented each time its journal is replayed >>> representing an unclean shutdown. That shouldn't happen often, but if >>> it does, there might be any number of i_version updates that may have >>> gotten lost. So in that case, the NFS client should invalidate all of >>> its caches. >>> >>> If the i_version field was large enough, we could just prefix the >>> "unclean shutdown counter" with the existing i_version number when it >>> is sent over the NFS protocol to the client. But if that field is too >>> small, and if (as I understand things) NFS just needs to know when >>> i_version is different, we could just simply hash the "unclean >>> shtudown counter" with the inode's "i_version counter", and let that >>> be the version which is sent from the NFS client to the server. >>> >>> If we could do that, then it doesn't become critical that every single >>> i_version bump has to be persisted to disk, and we could treat it like >>> a lazytime update; it's guaranteed to updated when we do an clean >>> unmount of the file system (and when the file system is frozen), but >>> on a crash, there is no guaranteee that all i_version bumps will be >>> persisted, but we do have this "unclean shutdown" counter to deal with >>> that case. >>> >>> Would this make life easier for folks? >>> >>> - Ted >> >> Thanks for chiming in, Ted. That's part of the problem, but we're >> actually not too worried about that case: >> >> nfsd mixes the ctime in with i_version, so you'd have to crash+clock >> jump backward by juuuust enough to allow you to get the i_version and >> ctime into a state it was before the crash, but with different data. >> We're assuming that that is difficult to achieve in practice. > > But a change in the clock could still cause our returned change > attribute to go backwards (even without a crash). Not sure how to > evaluate the risk, but it was enough that Trond hasn't been comfortable > with nfsd advertising NFS4_CHANGE_TYPE_IS_MONOTONIC. > > Ted's idea would be sufficient to allow us to turn that flag on, which I > think allows some client-side optimizations. > >> The issue with a reboot counter (or similar) is that on an unclean crash >> the NFS client would end up invalidating every inode in the cache, as >> all of the i_versions would change. That's probably excessive. > > But if we use the crash counter on write instead of read, we don't > invalidate caches unnecessarily. And I think the monotonicity would > still be close enough for our purposes? > >> The bigger issue (at the moment) is atomicity: when we fetch an >> i_version, the natural inclination is to associate that with the state >> of the inode at some point in time, so we need this to be updated >> atomically with certain other attributes of the inode. That's the part >> I'm trying to sort through at the moment. > > That may be, but I still suspect the crash counter would help. Fwiw, I like the crash counter idea too. -- Chuck Lever
On Thu, 2022-09-08 at 11:56 -0400, J. Bruce Fields wrote: > On Thu, Sep 08, 2022 at 11:44:33AM -0400, Jeff Layton wrote: > > On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote: > > > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote: > > > > It boils down to the fact that we don't want to call mark_inode_dirty() > > > > from IOCB_NOWAIT path because for lots of filesystems that means journal > > > > operation and there are high chances that may block. > > > > > > > > Presumably we could treat inode dirtying after i_version change similarly > > > > to how we handle timestamp updates with lazytime mount option (i.e., not > > > > dirty the inode immediately but only with a delay) but then the time window > > > > for i_version inconsistencies due to a crash would be much larger. > > > > > > Perhaps this is a radical suggestion, but there seems to be a lot of > > > the problems which are due to the concern "what if the file system > > > crashes" (and so we need to worry about making sure that any > > > increments to i_version MUST be persisted after it is incremented). > > > > > > Well, if we assume that unclean shutdowns are rare, then perhaps we > > > shouldn't be optimizing for that case. So.... what if a file system > > > had a counter which got incremented each time its journal is replayed > > > representing an unclean shutdown. That shouldn't happen often, but if > > > it does, there might be any number of i_version updates that may have > > > gotten lost. So in that case, the NFS client should invalidate all of > > > its caches. > > > > > > If the i_version field was large enough, we could just prefix the > > > "unclean shutdown counter" with the existing i_version number when it > > > is sent over the NFS protocol to the client. But if that field is too > > > small, and if (as I understand things) NFS just needs to know when > > > i_version is different, we could just simply hash the "unclean > > > shtudown counter" with the inode's "i_version counter", and let that > > > be the version which is sent from the NFS client to the server. > > > > > > If we could do that, then it doesn't become critical that every single > > > i_version bump has to be persisted to disk, and we could treat it like > > > a lazytime update; it's guaranteed to updated when we do an clean > > > unmount of the file system (and when the file system is frozen), but > > > on a crash, there is no guaranteee that all i_version bumps will be > > > persisted, but we do have this "unclean shutdown" counter to deal with > > > that case. > > > > > > Would this make life easier for folks? > > > > > > - Ted > > > > Thanks for chiming in, Ted. That's part of the problem, but we're > > actually not too worried about that case: > > > > nfsd mixes the ctime in with i_version, so you'd have to crash+clock > > jump backward by juuuust enough to allow you to get the i_version and > > ctime into a state it was before the crash, but with different data. > > We're assuming that that is difficult to achieve in practice. > > But a change in the clock could still cause our returned change > attribute to go backwards (even without a crash). Not sure how to > evaluate the risk, but it was enough that Trond hasn't been comfortable > with nfsd advertising NFS4_CHANGE_TYPE_IS_MONOTONIC. > > Ted's idea would be sufficient to allow us to turn that flag on, which I > think allows some client-side optimizations. > Good point. > > The issue with a reboot counter (or similar) is that on an unclean crash > > the NFS client would end up invalidating every inode in the cache, as > > all of the i_versions would change. That's probably excessive. > > But if we use the crash counter on write instead of read, we don't > invalidate caches unnecessarily. And I think the monotonicity would > still be close enough for our purposes? > > > The bigger issue (at the moment) is atomicity: when we fetch an > > i_version, the natural inclination is to associate that with the state > > of the inode at some point in time, so we need this to be updated > > atomically with certain other attributes of the inode. That's the part > > I'm trying to sort through at the moment. > > That may be, but I still suspect the crash counter would help. > Yeah, ok. That does make some sense. So we would mix this into the i_version instead of the ctime when it was available. Preferably, we'd mix that in when we store the i_version rather than adding it afterward. Ted, how would we access this? Maybe we could just add a new (generic) super_block field for this that ext4 (and other filesystems) could populate at mount time?
On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: > Yeah, ok. That does make some sense. So we would mix this into the > i_version instead of the ctime when it was available. Preferably, we'd > mix that in when we store the i_version rather than adding it afterward. > > Ted, how would we access this? Maybe we could just add a new (generic) > super_block field for this that ext4 (and other filesystems) could > populate at mount time? Couldn't the filesystem just return an ino_version that already includes it? --b.
On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote: > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: > > Yeah, ok. That does make some sense. So we would mix this into the > > i_version instead of the ctime when it was available. Preferably, we'd > > mix that in when we store the i_version rather than adding it afterward. > > > > Ted, how would we access this? Maybe we could just add a new (generic) > > super_block field for this that ext4 (and other filesystems) could > > populate at mount time? > > Couldn't the filesystem just return an ino_version that already includes > it? > Yes. That's simple if we want to just fold it in during getattr. If we want to fold that into the values stored on disk, then I'm a little less clear on how that will work. Maybe I need a concrete example of how that will work: Suppose we have an i_version value X with the previous crash counter already factored in that makes it to disk. We hand out a newer version X+1 to a client, but that value never makes it to disk. The machine crashes and comes back up, and we get a query for i_version and it comes back as X. Fine, it's an old version. Now there is a write. What do we do to ensure that the new value doesn't collide with X+1?
On Thu, 08 Sep 2022, Jeff Layton wrote: > On Thu, 2022-09-08 at 10:40 +1000, NeilBrown wrote: > > On Thu, 08 Sep 2022, Jeff Layton wrote: > > > On Wed, 2022-09-07 at 13:55 +0000, Trond Myklebust wrote: > > > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with > > > > > > > > > respect to the > > > > > > > > > +other changes in the inode. On a write, for instance, the > > > > > > > > > i_version it usually > > > > > > > > > +incremented before the data is copied into the pagecache. > > > > > > > > > Therefore it is > > > > > > > > > +possible to see a new i_version value while a read still > > > > > > > > > shows the old data. > > > > > > > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing to an > > > > > > > older > > > > > > > sample anyway. If you do "statx; read; statx" and the value > > > > > > > hasn't > > > > > > > changed, then you know that things are stable. > > > > > > > > > > > > I don't see how that helps. It's still possible to get: > > > > > > > > > > > > reader writer > > > > > > ------ ------ > > > > > > i_version++ > > > > > > statx > > > > > > read > > > > > > statx > > > > > > update page cache > > > > > > > > > > > > right? > > > > > > > > > > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In > > > > > that case, maybe this is useless then other than for testing purposes > > > > > and userland NFS servers. > > > > > > > > > > Would it be better to not consume a statx field with this if so? What > > > > > could we use as an alternate interface? ioctl? Some sort of global > > > > > virtual xattr? It does need to be something per-inode. > > > > > > > > I don't see how a non-atomic change attribute is remotely useful even > > > > for NFS. > > > > > > > > The main problem is not so much the above (although NFS clients are > > > > vulnerable to that too) but the behaviour w.r.t. directory changes. > > > > > > > > If the server can't guarantee that file/directory/... creation and > > > > unlink are atomically recorded with change attribute updates, then the > > > > client has to always assume that the server is lying, and that it has > > > > to revalidate all its caches anyway. Cue endless readdir/lookup/getattr > > > > requests after each and every directory modification in order to check > > > > that some other client didn't also sneak in a change of their own. > > > > > > > > > > We generally hold the parent dir's inode->i_rwsem exclusively over most > > > important directory changes, and the times/i_version are also updated > > > while holding it. What we don't do is serialize reads of this value vs. > > > the i_rwsem, so you could see new directory contents alongside an old > > > i_version. Maybe we should be taking it for read when we query it on a > > > directory? > > > > We do hold i_rwsem today. I'm working on changing that. Preserving > > atomic directory changeinfo will be a challenge. The only mechanism I > > can think if is to pass a "u64*" to all the directory modification ops, > > and they fill in the version number at the point where it is incremented > > (inode_maybe_inc_iversion_return()). The (nfsd) caller assumes that > > "before" was one less than "after". If you don't want to internally > > require single increments, then you would need to pass a 'u64 [2]' to > > get two iversions back. > > > > That's a major redesign of what the i_version counter is today. It may > very well end up being needed, but that's going to touch a lot of stuff > in the VFS. Are you planning to do that as a part of your locking > changes? > "A major design"? How? The "one less than" might be, but allowing a directory morphing op to fill in a "u64 [2]" is just a new interface to existing data. One that allows fine grained atomicity. This would actually be really good for NFS. nfs_mkdir (for example) could easily have access to the atomic pre/post changedid provided by the server, and so could easily provide them to nfsd. I'm not planning to do this as part of my locking changes. In the first instance only NFS changes behaviour, and it doesn't provide atomic changeids, so there is no loss of functionality. When some other filesystem wants to opt-in to shared-locking on directories - that would be the time to push through a better interface. > > > > > > Achieving atomicity with file writes though is another matter entirely. > > > I'm not sure that's even doable or how to approach it if so. > > > Suggestions? > > > > Call inode_maybe_inc_version(page->host) in __folio_mark_dirty() ?? > > > > Writes can cover multiple folios so we'd be doing several increments per > write. Maybe that's ok? Should we also be updating the ctime at that > point as well? You would only do several increments if something was reading the value concurrently, and then you really should to several increments for correctness. > > Fetching the i_version under the i_rwsem is probably sufficient to fix > this though. Most of the write_iter ops already bump the i_version while > holding that lock, so this wouldn't add any extra locking to the write > codepaths. Adding new locking doesn't seem like a good idea. It's bound to have performance implications. It may well end up serialising the directory op that I'm currently trying to make parallelisable. Thanks, NeilBrown > > -- > Jeff Layton <jlayton@kernel.org> >
On Fri, 09 Sep 2022, Theodore Ts'o wrote: > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote: > > It boils down to the fact that we don't want to call mark_inode_dirty() > > from IOCB_NOWAIT path because for lots of filesystems that means journal > > operation and there are high chances that may block. > > > > Presumably we could treat inode dirtying after i_version change similarly > > to how we handle timestamp updates with lazytime mount option (i.e., not > > dirty the inode immediately but only with a delay) but then the time window > > for i_version inconsistencies due to a crash would be much larger. > > Perhaps this is a radical suggestion, but there seems to be a lot of > the problems which are due to the concern "what if the file system > crashes" (and so we need to worry about making sure that any > increments to i_version MUST be persisted after it is incremented). > > Well, if we assume that unclean shutdowns are rare, then perhaps we > shouldn't be optimizing for that case. So.... what if a file system > had a counter which got incremented each time its journal is replayed > representing an unclean shutdown. That shouldn't happen often, but if > it does, there might be any number of i_version updates that may have > gotten lost. So in that case, the NFS client should invalidate all of > its caches. I was also thinking that the filesystem could help close that gap, but I didn't like the "whole filesysem is dirty" approach. I instead imagined a "dirty" bit in the on-disk inode which was set soon after any open-for-write and cleared when the inode was finally written after there are no active opens and no unflushed data. The "soon after" would set a maximum window on possible lost version updates (which people seem to have comfortable with) without imposing a sync IO operation on open (for first write). When loading an inode from disk, if the dirty flag was set then the difference between current time and on-disk ctime (in nanoseconds) could be added to the version number. But maybe that is too complex for the gain. NeilBrown
On Fri, 09 Sep 2022, Jeff Layton wrote: > On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote: > > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote: > > > It boils down to the fact that we don't want to call mark_inode_dirty() > > > from IOCB_NOWAIT path because for lots of filesystems that means journal > > > operation and there are high chances that may block. > > > > > > Presumably we could treat inode dirtying after i_version change similarly > > > to how we handle timestamp updates with lazytime mount option (i.e., not > > > dirty the inode immediately but only with a delay) but then the time window > > > for i_version inconsistencies due to a crash would be much larger. > > > > Perhaps this is a radical suggestion, but there seems to be a lot of > > the problems which are due to the concern "what if the file system > > crashes" (and so we need to worry about making sure that any > > increments to i_version MUST be persisted after it is incremented). > > > > Well, if we assume that unclean shutdowns are rare, then perhaps we > > shouldn't be optimizing for that case. So.... what if a file system > > had a counter which got incremented each time its journal is replayed > > representing an unclean shutdown. That shouldn't happen often, but if > > it does, there might be any number of i_version updates that may have > > gotten lost. So in that case, the NFS client should invalidate all of > > its caches. > > > > If the i_version field was large enough, we could just prefix the > > "unclean shutdown counter" with the existing i_version number when it > > is sent over the NFS protocol to the client. But if that field is too > > small, and if (as I understand things) NFS just needs to know when > > i_version is different, we could just simply hash the "unclean > > shtudown counter" with the inode's "i_version counter", and let that > > be the version which is sent from the NFS client to the server. > > > > If we could do that, then it doesn't become critical that every single > > i_version bump has to be persisted to disk, and we could treat it like > > a lazytime update; it's guaranteed to updated when we do an clean > > unmount of the file system (and when the file system is frozen), but > > on a crash, there is no guaranteee that all i_version bumps will be > > persisted, but we do have this "unclean shutdown" counter to deal with > > that case. > > > > Would this make life easier for folks? > > > > - Ted > > Thanks for chiming in, Ted. That's part of the problem, but we're > actually not too worried about that case: > > nfsd mixes the ctime in with i_version, so you'd have to crash+clock > jump backward by juuuust enough to allow you to get the i_version and > ctime into a state it was before the crash, but with different data. > We're assuming that that is difficult to achieve in practice. > > The issue with a reboot counter (or similar) is that on an unclean crash > the NFS client would end up invalidating every inode in the cache, as > all of the i_versions would change. That's probably excessive. > > The bigger issue (at the moment) is atomicity: when we fetch an > i_version, the natural inclination is to associate that with the state > of the inode at some point in time, so we need this to be updated > atomically with certain other attributes of the inode. That's the part > I'm trying to sort through at the moment. I don't think atomicity matters nearly as much as ordering. The i_version must not be visible before the change that it reflects. It is OK for it to be after. Even seconds after without great cost. It is bad for it to be earlier. Any unlocked gap after the i_version update and before the change is visible can result in a race and incorrect caching. Even for directory updates where NFSv4 wants atomic before/after version numbers, they don't need to be atomic w.r.t. the change being visible. If three concurrent file creates cause the version number to go from 4 to 7, then it is important that one op sees "4,5", one sees "5,6" and one sees "6,7", but it doesn't matter if concurrent lookups only see version 4 even while they can see the newly created names. A longer gap increases the risk of an unnecessary cache flush, but it doesn't lead to incorrectness. So I think we should put the version update *after* the change is visible, and not require locking (beyond a memory barrier) when reading the version. It should be as soon after as practical, bit no sooner. NeilBrown
On Fri, 09 Sep 2022, Jeff Layton wrote: > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote: > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: > > > Yeah, ok. That does make some sense. So we would mix this into the > > > i_version instead of the ctime when it was available. Preferably, we'd > > > mix that in when we store the i_version rather than adding it afterward. > > > > > > Ted, how would we access this? Maybe we could just add a new (generic) > > > super_block field for this that ext4 (and other filesystems) could > > > populate at mount time? > > > > Couldn't the filesystem just return an ino_version that already includes > > it? > > > > Yes. That's simple if we want to just fold it in during getattr. If we > want to fold that into the values stored on disk, then I'm a little less > clear on how that will work. > > Maybe I need a concrete example of how that will work: > > Suppose we have an i_version value X with the previous crash counter > already factored in that makes it to disk. We hand out a newer version > X+1 to a client, but that value never makes it to disk. As I understand it, the crash counter would NEVER appear in the on-disk i_version. The crash counter is stable while a filesystem is mounted so is the same when loading an inode from disk and when writing it back. When loading, add crash counter to on-disk i_version to provide in-memory i_version. when storing, subtract crash counter from in-memory i_version to provide on-disk i_version. "add" and "subtract" could be any reversible hash, and its inverse. I would probably shift the crash counter up 16 and add/subtract. NeilBrown > > The machine crashes and comes back up, and we get a query for i_version > and it comes back as X. Fine, it's an old version. Now there is a write. > What do we do to ensure that the new value doesn't collide with X+1? > -- > Jeff Layton <jlayton@kernel.org> >
On Fri, 2022-09-09 at 09:01 +1000, NeilBrown wrote: > On Fri, 09 Sep 2022, Jeff Layton wrote: > > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote: > > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: > > > > Yeah, ok. That does make some sense. So we would mix this into the > > > > i_version instead of the ctime when it was available. Preferably, we'd > > > > mix that in when we store the i_version rather than adding it afterward. > > > > > > > > Ted, how would we access this? Maybe we could just add a new (generic) > > > > super_block field for this that ext4 (and other filesystems) could > > > > populate at mount time? > > > > > > Couldn't the filesystem just return an ino_version that already includes > > > it? > > > > > > > Yes. That's simple if we want to just fold it in during getattr. If we > > want to fold that into the values stored on disk, then I'm a little less > > clear on how that will work. > > > > Maybe I need a concrete example of how that will work: > > > > Suppose we have an i_version value X with the previous crash counter > > already factored in that makes it to disk. We hand out a newer version > > X+1 to a client, but that value never makes it to disk. > > As I understand it, the crash counter would NEVER appear in the on-disk > i_version. > The crash counter is stable while a filesystem is mounted so is the same > when loading an inode from disk and when writing it back. > > When loading, add crash counter to on-disk i_version to provide > in-memory i_version. > when storing, subtract crash counter from in-memory i_version to provide > on-disk i_version. > > "add" and "subtract" could be any reversible hash, and its inverse. I > would probably shift the crash counter up 16 and add/subtract. > > If you store the value with the crash counter already factored-in, then not every inode would end up being invalidated after a crash. If we try to mix it in later, the client will end up invalidating the cache even for inodes that had no changes. > > > > The machine crashes and comes back up, and we get a query for i_version > > and it comes back as X. Fine, it's an old version. Now there is a write. > > What do we do to ensure that the new value doesn't collide with X+1? > > -- > > Jeff Layton <jlayton@kernel.org> > >
On Fri, 09 Sep 2022, Jeff Layton wrote: > On Fri, 2022-09-09 at 09:01 +1000, NeilBrown wrote: > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote: > > > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: > > > > > Yeah, ok. That does make some sense. So we would mix this into the > > > > > i_version instead of the ctime when it was available. Preferably, we'd > > > > > mix that in when we store the i_version rather than adding it afterward. > > > > > > > > > > Ted, how would we access this? Maybe we could just add a new (generic) > > > > > super_block field for this that ext4 (and other filesystems) could > > > > > populate at mount time? > > > > > > > > Couldn't the filesystem just return an ino_version that already includes > > > > it? > > > > > > > > > > Yes. That's simple if we want to just fold it in during getattr. If we > > > want to fold that into the values stored on disk, then I'm a little less > > > clear on how that will work. > > > > > > Maybe I need a concrete example of how that will work: > > > > > > Suppose we have an i_version value X with the previous crash counter > > > already factored in that makes it to disk. We hand out a newer version > > > X+1 to a client, but that value never makes it to disk. > > > > As I understand it, the crash counter would NEVER appear in the on-disk > > i_version. > > The crash counter is stable while a filesystem is mounted so is the same > > when loading an inode from disk and when writing it back. > > > > When loading, add crash counter to on-disk i_version to provide > > in-memory i_version. > > when storing, subtract crash counter from in-memory i_version to provide > > on-disk i_version. > > > > "add" and "subtract" could be any reversible hash, and its inverse. I > > would probably shift the crash counter up 16 and add/subtract. > > > > > > If you store the value with the crash counter already factored-in, then > not every inode would end up being invalidated after a crash. If we try > to mix it in later, the client will end up invalidating the cache even > for inodes that had no changes. How do we know which inodes need the crash counter merged in? I thought the whole point of the crash counter was that it affected every file (easy, safe, expensive, but hopefully rare enough that the expense could be justified). NeilBrown > > > > > > > The machine crashes and comes back up, and we get a query for i_version > > > and it comes back as X. Fine, it's an old version. Now there is a write. > > > What do we do to ensure that the new value doesn't collide with X+1? > > > -- > > > Jeff Layton <jlayton@kernel.org> > > > > > -- > Jeff Layton <jlayton@kernel.org> >
On Fri, 2022-09-09 at 08:55 +1000, NeilBrown wrote: > On Fri, 09 Sep 2022, Jeff Layton wrote: > > On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote: > > > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote: > > > > It boils down to the fact that we don't want to call > > > > mark_inode_dirty() > > > > from IOCB_NOWAIT path because for lots of filesystems that > > > > means journal > > > > operation and there are high chances that may block. > > > > > > > > Presumably we could treat inode dirtying after i_version change > > > > similarly > > > > to how we handle timestamp updates with lazytime mount option > > > > (i.e., not > > > > dirty the inode immediately but only with a delay) but then the > > > > time window > > > > for i_version inconsistencies due to a crash would be much > > > > larger. > > > > > > Perhaps this is a radical suggestion, but there seems to be a lot > > > of > > > the problems which are due to the concern "what if the file > > > system > > > crashes" (and so we need to worry about making sure that any > > > increments to i_version MUST be persisted after it is > > > incremented). > > > > > > Well, if we assume that unclean shutdowns are rare, then perhaps > > > we > > > shouldn't be optimizing for that case. So.... what if a file > > > system > > > had a counter which got incremented each time its journal is > > > replayed > > > representing an unclean shutdown. That shouldn't happen often, > > > but if > > > it does, there might be any number of i_version updates that may > > > have > > > gotten lost. So in that case, the NFS client should invalidate > > > all of > > > its caches. > > > > > > If the i_version field was large enough, we could just prefix the > > > "unclean shutdown counter" with the existing i_version number > > > when it > > > is sent over the NFS protocol to the client. But if that field > > > is too > > > small, and if (as I understand things) NFS just needs to know > > > when > > > i_version is different, we could just simply hash the "unclean > > > shtudown counter" with the inode's "i_version counter", and let > > > that > > > be the version which is sent from the NFS client to the server. > > > > > > If we could do that, then it doesn't become critical that every > > > single > > > i_version bump has to be persisted to disk, and we could treat it > > > like > > > a lazytime update; it's guaranteed to updated when we do an clean > > > unmount of the file system (and when the file system is frozen), > > > but > > > on a crash, there is no guaranteee that all i_version bumps will > > > be > > > persisted, but we do have this "unclean shutdown" counter to deal > > > with > > > that case. > > > > > > Would this make life easier for folks? > > > > > > - Ted > > > > Thanks for chiming in, Ted. That's part of the problem, but we're > > actually not too worried about that case: > > > > nfsd mixes the ctime in with i_version, so you'd have to > > crash+clock > > jump backward by juuuust enough to allow you to get the i_version > > and > > ctime into a state it was before the crash, but with different > > data. > > We're assuming that that is difficult to achieve in practice. > > > > The issue with a reboot counter (or similar) is that on an unclean > > crash > > the NFS client would end up invalidating every inode in the cache, > > as > > all of the i_versions would change. That's probably excessive. > > > > The bigger issue (at the moment) is atomicity: when we fetch an > > i_version, the natural inclination is to associate that with the > > state > > of the inode at some point in time, so we need this to be updated > > atomically with certain other attributes of the inode. That's the > > part > > I'm trying to sort through at the moment. > > I don't think atomicity matters nearly as much as ordering. > > The i_version must not be visible before the change that it reflects. > It is OK for it to be after. Even seconds after without great cost. > It > is bad for it to be earlier. Any unlocked gap after the i_version > update and before the change is visible can result in a race and > incorrect caching. > > Even for directory updates where NFSv4 wants atomic before/after > version > numbers, they don't need to be atomic w.r.t. the change being > visible. > > If three concurrent file creates cause the version number to go from > 4 > to 7, then it is important that one op sees "4,5", one sees "5,6" and > one sees "6,7", but it doesn't matter if concurrent lookups only see > version 4 even while they can see the newly created names. > > A longer gap increases the risk of an unnecessary cache flush, but it > doesn't lead to incorrectness. > I'm not really sure what you mean when you say that a 'longer gap increases the risk of an unnecessary cache flush'. Either the change attribute update is atomic with the operation it is recording, or it is not. If that update is recorded in the NFS reply as not being atomic, then the client will evict all cached data that is associated with that change attribute at some point. > So I think we should put the version update *after* the change is > visible, and not require locking (beyond a memory barrier) when > reading > the version. It should be as soon after as practical, bit no sooner. > Ordering is not a sufficient condition. The guarantee needs to be that any application that reads the change attribute, then reads file data and then reads the change attribute again will see the 2 change attribute values as being the same *if and only if* there were no changes to the file data made after the read and before the read of the change attribute. That includes the case where data was written after the read, and a crash occurred after it was committed to stable storage. If you only update the version after the written data is visible, then there is a possibility that the crash could occur before any change attribute update is committed to disk. IOW: the minimal condition needs to be that for all cases below, the application reads 'state B' as having occurred if any data was committed to disk before the crash. Application Filesystem =========== ========== read change attr <- 'state A' read data <- 'state A' write data -> 'state B' <crash>+<reboot> read change attr <- 'state B'
On Fri, 09 Sep 2022, Trond Myklebust wrote: > On Fri, 2022-09-09 at 08:55 +1000, NeilBrown wrote: > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote: > > > > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote: > > > > > It boils down to the fact that we don't want to call > > > > > mark_inode_dirty() > > > > > from IOCB_NOWAIT path because for lots of filesystems that > > > > > means journal > > > > > operation and there are high chances that may block. > > > > > > > > > > Presumably we could treat inode dirtying after i_version change > > > > > similarly > > > > > to how we handle timestamp updates with lazytime mount option > > > > > (i.e., not > > > > > dirty the inode immediately but only with a delay) but then the > > > > > time window > > > > > for i_version inconsistencies due to a crash would be much > > > > > larger. > > > > > > > > Perhaps this is a radical suggestion, but there seems to be a lot > > > > of > > > > the problems which are due to the concern "what if the file > > > > system > > > > crashes" (and so we need to worry about making sure that any > > > > increments to i_version MUST be persisted after it is > > > > incremented). > > > > > > > > Well, if we assume that unclean shutdowns are rare, then perhaps > > > > we > > > > shouldn't be optimizing for that case. So.... what if a file > > > > system > > > > had a counter which got incremented each time its journal is > > > > replayed > > > > representing an unclean shutdown. That shouldn't happen often, > > > > but if > > > > it does, there might be any number of i_version updates that may > > > > have > > > > gotten lost. So in that case, the NFS client should invalidate > > > > all of > > > > its caches. > > > > > > > > If the i_version field was large enough, we could just prefix the > > > > "unclean shutdown counter" with the existing i_version number > > > > when it > > > > is sent over the NFS protocol to the client. But if that field > > > > is too > > > > small, and if (as I understand things) NFS just needs to know > > > > when > > > > i_version is different, we could just simply hash the "unclean > > > > shtudown counter" with the inode's "i_version counter", and let > > > > that > > > > be the version which is sent from the NFS client to the server. > > > > > > > > If we could do that, then it doesn't become critical that every > > > > single > > > > i_version bump has to be persisted to disk, and we could treat it > > > > like > > > > a lazytime update; it's guaranteed to updated when we do an clean > > > > unmount of the file system (and when the file system is frozen), > > > > but > > > > on a crash, there is no guaranteee that all i_version bumps will > > > > be > > > > persisted, but we do have this "unclean shutdown" counter to deal > > > > with > > > > that case. > > > > > > > > Would this make life easier for folks? > > > > > > > > - Ted > > > > > > Thanks for chiming in, Ted. That's part of the problem, but we're > > > actually not too worried about that case: > > > > > > nfsd mixes the ctime in with i_version, so you'd have to > > > crash+clock > > > jump backward by juuuust enough to allow you to get the i_version > > > and > > > ctime into a state it was before the crash, but with different > > > data. > > > We're assuming that that is difficult to achieve in practice. > > > > > > The issue with a reboot counter (or similar) is that on an unclean > > > crash > > > the NFS client would end up invalidating every inode in the cache, > > > as > > > all of the i_versions would change. That's probably excessive. > > > > > > The bigger issue (at the moment) is atomicity: when we fetch an > > > i_version, the natural inclination is to associate that with the > > > state > > > of the inode at some point in time, so we need this to be updated > > > atomically with certain other attributes of the inode. That's the > > > part > > > I'm trying to sort through at the moment. > > > > I don't think atomicity matters nearly as much as ordering. > > > > The i_version must not be visible before the change that it reflects. > > It is OK for it to be after. Even seconds after without great cost. > > It > > is bad for it to be earlier. Any unlocked gap after the i_version > > update and before the change is visible can result in a race and > > incorrect caching. > > > > Even for directory updates where NFSv4 wants atomic before/after > > version > > numbers, they don't need to be atomic w.r.t. the change being > > visible. > > > > If three concurrent file creates cause the version number to go from > > 4 > > to 7, then it is important that one op sees "4,5", one sees "5,6" and > > one sees "6,7", but it doesn't matter if concurrent lookups only see > > version 4 even while they can see the newly created names. > > > > A longer gap increases the risk of an unnecessary cache flush, but it > > doesn't lead to incorrectness. > > > > I'm not really sure what you mean when you say that a 'longer gap > increases the risk of an unnecessary cache flush'. Either the change > attribute update is atomic with the operation it is recording, or it is > not. If that update is recorded in the NFS reply as not being atomic, > then the client will evict all cached data that is associated with that > change attribute at some point. > > > So I think we should put the version update *after* the change is > > visible, and not require locking (beyond a memory barrier) when > > reading > > the version. It should be as soon after as practical, bit no sooner. > > > > Ordering is not a sufficient condition. The guarantee needs to be that > any application that reads the change attribute, then reads file data > and then reads the change attribute again will see the 2 change > attribute values as being the same *if and only if* there were no > changes to the file data made after the read and before the read of the > change attribute. I'm say that only the "only if" is mandatory - getting that wrong has a correctness cost. BUT the "if" is less critical. Getting that wrong has a performance cost. We want to get it wrong as rarely as possible, but there is a performance cost to the underlying filesystem in providing perfection, and that must be balanced with the performance cost to NFS of providing imperfect results. For NFSv4, this is of limited interest for files. If the client has a delegation, then it is certain that no other client or server-side application will change the file, so it doesn't need to pay much attention to change ids. If the client doesn't have a delegation, then if there is any change to the changeid, the client cannot be certain that the change wasn't due to some other client, so it must purge its cache on close or lock. So fine details of the changeid aren't interesting (as long as we have the "only if"). For directories, NFSv4 does want precise changeids, but directory ops needs to be sync for NFS anyway, so the extra burden on the fs is small. > That includes the case where data was written after the read, and a > crash occurred after it was committed to stable storage. If you only > update the version after the written data is visible, then there is a > possibility that the crash could occur before any change attribute > update is committed to disk. I think we all agree that handling a crash is hard. I think that should be a separate consideration to how i_version is handled during normal running. > > IOW: the minimal condition needs to be that for all cases below, the > application reads 'state B' as having occurred if any data was > committed to disk before the crash. > > Application Filesystem > =========== ========== > read change attr <- 'state A' > read data <- 'state A' > write data -> 'state B' > <crash>+<reboot> > read change attr <- 'state B' The important thing here is to not see 'state A'. Seeing 'state C' should be acceptable. Worst case we could merge in wall-clock time of system boot, but the filesystem should be able to be more helpful than that. NeilBrown > > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > > >
On Fri, 2022-09-09 at 10:51 +1000, NeilBrown wrote: > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > On Fri, 2022-09-09 at 08:55 +1000, NeilBrown wrote: > > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote: > > > > > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote: > > > > > > It boils down to the fact that we don't want to call > > > > > > mark_inode_dirty() > > > > > > from IOCB_NOWAIT path because for lots of filesystems that > > > > > > means journal > > > > > > operation and there are high chances that may block. > > > > > > > > > > > > Presumably we could treat inode dirtying after i_version > > > > > > change > > > > > > similarly > > > > > > to how we handle timestamp updates with lazytime mount > > > > > > option > > > > > > (i.e., not > > > > > > dirty the inode immediately but only with a delay) but then > > > > > > the > > > > > > time window > > > > > > for i_version inconsistencies due to a crash would be much > > > > > > larger. > > > > > > > > > > Perhaps this is a radical suggestion, but there seems to be a > > > > > lot > > > > > of > > > > > the problems which are due to the concern "what if the file > > > > > system > > > > > crashes" (and so we need to worry about making sure that any > > > > > increments to i_version MUST be persisted after it is > > > > > incremented). > > > > > > > > > > Well, if we assume that unclean shutdowns are rare, then > > > > > perhaps > > > > > we > > > > > shouldn't be optimizing for that case. So.... what if a file > > > > > system > > > > > had a counter which got incremented each time its journal is > > > > > replayed > > > > > representing an unclean shutdown. That shouldn't happen > > > > > often, > > > > > but if > > > > > it does, there might be any number of i_version updates that > > > > > may > > > > > have > > > > > gotten lost. So in that case, the NFS client should > > > > > invalidate > > > > > all of > > > > > its caches. > > > > > > > > > > If the i_version field was large enough, we could just prefix > > > > > the > > > > > "unclean shutdown counter" with the existing i_version number > > > > > when it > > > > > is sent over the NFS protocol to the client. But if that > > > > > field > > > > > is too > > > > > small, and if (as I understand things) NFS just needs to know > > > > > when > > > > > i_version is different, we could just simply hash the > > > > > "unclean > > > > > shtudown counter" with the inode's "i_version counter", and > > > > > let > > > > > that > > > > > be the version which is sent from the NFS client to the > > > > > server. > > > > > > > > > > If we could do that, then it doesn't become critical that > > > > > every > > > > > single > > > > > i_version bump has to be persisted to disk, and we could > > > > > treat it > > > > > like > > > > > a lazytime update; it's guaranteed to updated when we do an > > > > > clean > > > > > unmount of the file system (and when the file system is > > > > > frozen), > > > > > but > > > > > on a crash, there is no guaranteee that all i_version bumps > > > > > will > > > > > be > > > > > persisted, but we do have this "unclean shutdown" counter to > > > > > deal > > > > > with > > > > > that case. > > > > > > > > > > Would this make life easier for folks? > > > > > > > > > > - Ted > > > > > > > > Thanks for chiming in, Ted. That's part of the problem, but > > > > we're > > > > actually not too worried about that case: > > > > > > > > nfsd mixes the ctime in with i_version, so you'd have to > > > > crash+clock > > > > jump backward by juuuust enough to allow you to get the > > > > i_version > > > > and > > > > ctime into a state it was before the crash, but with different > > > > data. > > > > We're assuming that that is difficult to achieve in practice. > > > > > > > > The issue with a reboot counter (or similar) is that on an > > > > unclean > > > > crash > > > > the NFS client would end up invalidating every inode in the > > > > cache, > > > > as > > > > all of the i_versions would change. That's probably excessive. > > > > > > > > The bigger issue (at the moment) is atomicity: when we fetch an > > > > i_version, the natural inclination is to associate that with > > > > the > > > > state > > > > of the inode at some point in time, so we need this to be > > > > updated > > > > atomically with certain other attributes of the inode. That's > > > > the > > > > part > > > > I'm trying to sort through at the moment. > > > > > > I don't think atomicity matters nearly as much as ordering. > > > > > > The i_version must not be visible before the change that it > > > reflects. > > > It is OK for it to be after. Even seconds after without great > > > cost. > > > It > > > is bad for it to be earlier. Any unlocked gap after the > > > i_version > > > update and before the change is visible can result in a race and > > > incorrect caching. > > > > > > Even for directory updates where NFSv4 wants atomic before/after > > > version > > > numbers, they don't need to be atomic w.r.t. the change being > > > visible. > > > > > > If three concurrent file creates cause the version number to go > > > from > > > 4 > > > to 7, then it is important that one op sees "4,5", one sees "5,6" > > > and > > > one sees "6,7", but it doesn't matter if concurrent lookups only > > > see > > > version 4 even while they can see the newly created names. > > > > > > A longer gap increases the risk of an unnecessary cache flush, > > > but it > > > doesn't lead to incorrectness. > > > > > > > I'm not really sure what you mean when you say that a 'longer gap > > increases the risk of an unnecessary cache flush'. Either the > > change > > attribute update is atomic with the operation it is recording, or > > it is > > not. If that update is recorded in the NFS reply as not being > > atomic, > > then the client will evict all cached data that is associated with > > that > > change attribute at some point. > > > > > So I think we should put the version update *after* the change is > > > visible, and not require locking (beyond a memory barrier) when > > > reading > > > the version. It should be as soon after as practical, bit no > > > sooner. > > > > > > > Ordering is not a sufficient condition. The guarantee needs to be > > that > > any application that reads the change attribute, then reads file > > data > > and then reads the change attribute again will see the 2 change > > attribute values as being the same *if and only if* there were no > > changes to the file data made after the read and before the read of > > the > > change attribute. > > I'm say that only the "only if" is mandatory - getting that wrong has > a > correctness cost. > BUT the "if" is less critical. Getting that wrong has a performance > cost. We want to get it wrong as rarely as possible, but there is a > performance cost to the underlying filesystem in providing > perfection, > and that must be balanced with the performance cost to NFS of > providing > imperfect results. > I strongly disagree. If the 2 change attribute values are different, then it is OK for the file data to be the same, but if the file data has changed, then the change attributes MUST differ. Conversely, if the 2 change attributes are the same then it MUST be the case that the file data did not change. So it really needs to be an 'if and only if' case. > For NFSv4, this is of limited interest for files. > If the client has a delegation, then it is certain that no other > client > or server-side application will change the file, so it doesn't need > to > pay much attention to change ids. > If the client doesn't have a delegation, then if there is any change > to > the changeid, the client cannot be certain that the change wasn't due > to > some other client, so it must purge its cache on close or lock. So > fine > details of the changeid aren't interesting (as long as we have the > "only > if"). > > For directories, NFSv4 does want precise changeids, but directory ops > needs to be sync for NFS anyway, so the extra burden on the fs is > small. > > > > That includes the case where data was written after the read, and a > > crash occurred after it was committed to stable storage. If you > > only > > update the version after the written data is visible, then there is > > a > > possibility that the crash could occur before any change attribute > > update is committed to disk. > > I think we all agree that handling a crash is hard. I think that > should be a separate consideration to how i_version is handled during > normal running. > > > > > IOW: the minimal condition needs to be that for all cases below, > > the > > application reads 'state B' as having occurred if any data was > > committed to disk before the crash. > > > > Application Filesystem > > =========== ========== > > read change attr <- 'state A' > > read data <- 'state A' > > write data -> 'state B' > > <crash>+<reboot> > > read change attr <- 'state B' > > The important thing here is to not see 'state A'. Seeing 'state C' > should be acceptable. Worst case we could merge in wall-clock time > of > system boot, but the filesystem should be able to be more helpful > than > that. > Agreed.
On Fri, 09 Sep 2022, NeilBrown wrote: > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > > > > IOW: the minimal condition needs to be that for all cases below, the > > application reads 'state B' as having occurred if any data was > > committed to disk before the crash. > > > > Application Filesystem > > =========== ========= > > read change attr <- 'state A' > > read data <- 'state A' > > write data -> 'state B' > > <crash>+<reboot> > > read change attr <- 'state B' > > The important thing here is to not see 'state A'. Seeing 'state C' > should be acceptable. Worst case we could merge in wall-clock time of > system boot, but the filesystem should be able to be more helpful than > that. > Actually, without the crash+reboot it would still be acceptable to see "state A" at the end there - but preferably not for long. From the NFS perspective, the changeid needs to update by the time of a close or unlock (so it is visible to open or lock), but before that it is just best-effort. NeilBrown
On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote: > On Fri, 09 Sep 2022, NeilBrown wrote: > > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > > > > > > > IOW: the minimal condition needs to be that for all cases below, > > > the > > > application reads 'state B' as having occurred if any data was > > > committed to disk before the crash. > > > > > > Application Filesystem > > > =========== ========= > > > read change attr <- 'state A' > > > read data <- 'state A' > > > write data -> 'state B' > > > <crash>+<reboot> > > > read change attr <- 'state B' > > > > The important thing here is to not see 'state A'. Seeing 'state C' > > should be acceptable. Worst case we could merge in wall-clock time > > of > > system boot, but the filesystem should be able to be more helpful > > than > > that. > > > > Actually, without the crash+reboot it would still be acceptable to > see > "state A" at the end there - but preferably not for long. > From the NFS perspective, the changeid needs to update by the time of > a > close or unlock (so it is visible to open or lock), but before that > it > is just best-effort. Nope. That will inevitably lead to data corruption, since the application might decide to use the data from state A instead of revalidating it.
On Fri, 2022-09-09 at 01:10 +0000, Trond Myklebust wrote: > On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote: > > On Fri, 09 Sep 2022, NeilBrown wrote: > > > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > > > > > > > > > > IOW: the minimal condition needs to be that for all cases > > > > below, > > > > the > > > > application reads 'state B' as having occurred if any data was > > > > committed to disk before the crash. > > > > > > > > Application Filesystem > > > > =========== ========= > > > > read change attr <- 'state A' > > > > read data <- 'state A' > > > > write data -> 'state B' > > > > <crash>+<reboot> > > > > read change attr <- 'state B' > > > > > > The important thing here is to not see 'state A'. Seeing 'state > > > C' > > > should be acceptable. Worst case we could merge in wall-clock > > > time > > > of > > > system boot, but the filesystem should be able to be more helpful > > > than > > > that. > > > > > > > Actually, without the crash+reboot it would still be acceptable to > > see > > "state A" at the end there - but preferably not for long. > > From the NFS perspective, the changeid needs to update by the time > > of > > a > > close or unlock (so it is visible to open or lock), but before that > > it > > is just best-effort. > > Nope. That will inevitably lead to data corruption, since the > application might decide to use the data from state A instead of > revalidating it. > The point is, NFS is not the only potential use case for change attributes. We wouldn't be bothering to discuss statx() if it was. I could be using O_DIRECT, and all the tricks in order to ensure that my stock broker application (to choose one example) has access to the absolute very latest prices when I'm trying to execute a trade. When the filesystem then says 'the prices haven't changed since your last read because the change attribute on the database file is the same' in response to a statx() request with the AT_STATX_FORCE_SYNC flag set, then why shouldn't my application be able to assume it can serve those prices right out of memory instead of having to go to disk?
On Fri, 09 Sep 2022, Trond Myklebust wrote: > On Fri, 2022-09-09 at 01:10 +0000, Trond Myklebust wrote: > > On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote: > > > On Fri, 09 Sep 2022, NeilBrown wrote: > > > > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > > > > > > > > > > > > > IOW: the minimal condition needs to be that for all cases > > > > > below, > > > > > the > > > > > application reads 'state B' as having occurred if any data was > > > > > committed to disk before the crash. > > > > > > > > > > Application Filesystem > > > > > =========== ========= > > > > > read change attr <- 'state A' > > > > > read data <- 'state A' > > > > > write data -> 'state B' > > > > > <crash>+<reboot> > > > > > read change attr <- 'state B' > > > > > > > > The important thing here is to not see 'state A'. Seeing 'state > > > > C' > > > > should be acceptable. Worst case we could merge in wall-clock > > > > time > > > > of > > > > system boot, but the filesystem should be able to be more helpful > > > > than > > > > that. > > > > > > > > > > Actually, without the crash+reboot it would still be acceptable to > > > see > > > "state A" at the end there - but preferably not for long. > > > From the NFS perspective, the changeid needs to update by the time > > > of > > > a > > > close or unlock (so it is visible to open or lock), but before that > > > it > > > is just best-effort. > > > > Nope. That will inevitably lead to data corruption, since the > > application might decide to use the data from state A instead of > > revalidating it. > > > > The point is, NFS is not the only potential use case for change > attributes. We wouldn't be bothering to discuss statx() if it was. My understanding is that it was primarily a desire to add fstests to exercise the i_version which motivated the statx extension. Obviously we should prepare for other uses though. > > I could be using O_DIRECT, and all the tricks in order to ensure that > my stock broker application (to choose one example) has access to the > absolute very latest prices when I'm trying to execute a trade. > When the filesystem then says 'the prices haven't changed since your > last read because the change attribute on the database file is the > same' in response to a statx() request with the AT_STATX_FORCE_SYNC > flag set, then why shouldn't my application be able to assume it can > serve those prices right out of memory instead of having to go to disk? I would think that such an application would be using inotify rather than having to poll. But certainly we should have a clear statement of quality-of-service parameters in the documentation. If we agree that perfect atomicity is what we want to promise, and that the cost to the filesystem and the statx call is acceptable, then so be it. My point wasn't to say that atomicity is bad. It was that: - if the i_version change is visible before the change itself is visible, then that is a correctness problem. - if the i_version change is only visible some time after the change itself is visible, then that is a quality-of-service issue. I cannot see any room for debating the first. I do see some room to debate the second. Cached writes, directory ops, and attribute changes are, I think, easy enough to provide truly atomic i_version updates with the change being visible. Changes to a shared memory-mapped files is probably the hardest to provide timely i_version updates for. We might want to document an explicit exception for those. Alternately each request for i_version would need to find all pages that are writable, remap them read-only to catch future writes, then update i_version if any were writable (i.e. ->mkwrite had been called). That is the only way I can think of to provide atomicity. O_DIRECT writes are a little easier than mmapped files. I suspect we should update the i_version once the device reports that the write is complete, but a parallel reader could have seem some of the write before that moment. True atomicity could only be provided by taking some exclusive lock that blocked all O_DIRECT writes. Jeff seems to be suggesting this, but I doubt the stock broker application would be willing to make the call in that case. I don't think I would either. NeilBrown > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > > >
On Fri, 2022-09-09 at 08:29 +1000, NeilBrown wrote: > On Thu, 08 Sep 2022, Jeff Layton wrote: > > On Thu, 2022-09-08 at 10:40 +1000, NeilBrown wrote: > > > On Thu, 08 Sep 2022, Jeff Layton wrote: > > > > On Wed, 2022-09-07 at 13:55 +0000, Trond Myklebust wrote: > > > > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > > > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with > > > > > > > > > > respect to the > > > > > > > > > > +other changes in the inode. On a write, for instance, the > > > > > > > > > > i_version it usually > > > > > > > > > > +incremented before the data is copied into the pagecache. > > > > > > > > > > Therefore it is > > > > > > > > > > +possible to see a new i_version value while a read still > > > > > > > > > > shows the old data. > > > > > > > > > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing to an > > > > > > > > older > > > > > > > > sample anyway. If you do "statx; read; statx" and the value > > > > > > > > hasn't > > > > > > > > changed, then you know that things are stable. > > > > > > > > > > > > > > I don't see how that helps. It's still possible to get: > > > > > > > > > > > > > > reader writer > > > > > > > ------ ------ > > > > > > > i_version++ > > > > > > > statx > > > > > > > read > > > > > > > statx > > > > > > > update page cache > > > > > > > > > > > > > > right? > > > > > > > > > > > > > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In > > > > > > that case, maybe this is useless then other than for testing purposes > > > > > > and userland NFS servers. > > > > > > > > > > > > Would it be better to not consume a statx field with this if so? What > > > > > > could we use as an alternate interface? ioctl? Some sort of global > > > > > > virtual xattr? It does need to be something per-inode. > > > > > > > > > > I don't see how a non-atomic change attribute is remotely useful even > > > > > for NFS. > > > > > > > > > > The main problem is not so much the above (although NFS clients are > > > > > vulnerable to that too) but the behaviour w.r.t. directory changes. > > > > > > > > > > If the server can't guarantee that file/directory/... creation and > > > > > unlink are atomically recorded with change attribute updates, then the > > > > > client has to always assume that the server is lying, and that it has > > > > > to revalidate all its caches anyway. Cue endless readdir/lookup/getattr > > > > > requests after each and every directory modification in order to check > > > > > that some other client didn't also sneak in a change of their own. > > > > > > > > > > > > > We generally hold the parent dir's inode->i_rwsem exclusively over most > > > > important directory changes, and the times/i_version are also updated > > > > while holding it. What we don't do is serialize reads of this value vs. > > > > the i_rwsem, so you could see new directory contents alongside an old > > > > i_version. Maybe we should be taking it for read when we query it on a > > > > directory? > > > > > > We do hold i_rwsem today. I'm working on changing that. Preserving > > > atomic directory changeinfo will be a challenge. The only mechanism I > > > can think if is to pass a "u64*" to all the directory modification ops, > > > and they fill in the version number at the point where it is incremented > > > (inode_maybe_inc_iversion_return()). The (nfsd) caller assumes that > > > "before" was one less than "after". If you don't want to internally > > > require single increments, then you would need to pass a 'u64 [2]' to > > > get two iversions back. > > > > > > > That's a major redesign of what the i_version counter is today. It may > > very well end up being needed, but that's going to touch a lot of stuff > > in the VFS. Are you planning to do that as a part of your locking > > changes? > > > > "A major design"? How? The "one less than" might be, but allowing a > directory morphing op to fill in a "u64 [2]" is just a new interface to > existing data. One that allows fine grained atomicity. > > This would actually be really good for NFS. nfs_mkdir (for example) > could easily have access to the atomic pre/post changedid provided by > the server, and so could easily provide them to nfsd. > > I'm not planning to do this as part of my locking changes. In the first > instance only NFS changes behaviour, and it doesn't provide atomic > changeids, so there is no loss of functionality. > > When some other filesystem wants to opt-in to shared-locking on > directories - that would be the time to push through a better interface. > I think nfsd does provide atomic changeids for directory operations currently. AFAICT, any operation where we're changing directory contents is done while holding the i_rwsem exclusively, and we hold that lock over the pre and post i_version fetch for the change_info4. If you change nfsd to allow parallel directory morphing operations without addressing this, then I think that would be a regression. > > > > > > > > > Achieving atomicity with file writes though is another matter entirely. > > > > I'm not sure that's even doable or how to approach it if so. > > > > Suggestions? > > > > > > Call inode_maybe_inc_version(page->host) in __folio_mark_dirty() ?? > > > > > > > Writes can cover multiple folios so we'd be doing several increments per > > write. Maybe that's ok? Should we also be updating the ctime at that > > point as well? > > You would only do several increments if something was reading the value > concurrently, and then you really should to several increments for > correctness. > Agreed. > > > > Fetching the i_version under the i_rwsem is probably sufficient to fix > > this though. Most of the write_iter ops already bump the i_version while > > holding that lock, so this wouldn't add any extra locking to the write > > codepaths. > > Adding new locking doesn't seem like a good idea. It's bound to have > performance implications. It may well end up serialising the directory > op that I'm currently trying to make parallelisable. > The new locking would only be in the NFSv4 GETATTR codepath: https://lore.kernel.org/linux-nfs/20220908172448.208585-9-jlayton@kernel.org/T/#u Maybe we'd still better off taking a hit in the write codepath instead of doing this, but with this, most of the penalty would be paid by nfsd which I would think would be preferred here. The problem of mmap writes is another matter though. Not sure what we can do about that without making i_version bumps a lot more expensive.
On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: > > Ted, how would we access this? Maybe we could just add a new (generic) > super_block field for this that ext4 (and other filesystems) could > populate at mount time? Yeah, I was thinking about just adding it to struct super, with some value (perhaps 0 or ~0) meaning that the file system didn't support it. If people were concerned about struct super bloat, we could also add some new function to struct super_ops that would return one or more values that are used rarely by most of the kernel code, and so doesn't need to be in the struct super data structure. I don't have strong feelings one way or another. On another note, my personal opinion is that at least as far as ext4 is concerned, i_version on disk's only use is for NFS's convenience, and so I have absolutely no problem with changing how and when i_version gets updated modulo concerns about impacting performance. That's one of the reasons why being able to update i_version only lazily, so that if we had, say, some workload that was doing O_DIRECT writes followed by fdatasync(), there wouldn't be any obligation to flush the inode out to disk just because we had bumped i_version appeals to me. But aside from that, I don't consider when i_version gets updated on disk, especially what the semantics are after a crash, and if we need to change things so that NFS can be more performant, I'm happy to accomodate. One of the reasons why we implemented the ext4 fast commit feature was to improve performance for NFS workloads. I know some XFS developers have some concerns here, but I just wanted to make it be explicit that (a) I'm not aware of any users who are depending on the i_version on-disk semantics, and (b) if they are depending on something which as far as I'm concerned in an internal implementation detail, we've made no promises to them, and they can get to keep both pieces. :-) This is especially since up until now, there is no supported, portable userspace interface to make i_version available to userspace. Cheers, - Ted
On Fri, 2022-09-09 at 08:11 -0400, Theodore Ts'o wrote: > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: > > > > Ted, how would we access this? Maybe we could just add a new (generic) > > super_block field for this that ext4 (and other filesystems) could > > populate at mount time? > > Yeah, I was thinking about just adding it to struct super, with some > value (perhaps 0 or ~0) meaning that the file system didn't support > it. If people were concerned about struct super bloat, we could also > add some new function to struct super_ops that would return one or > more values that are used rarely by most of the kernel code, and so > doesn't need to be in the struct super data structure. I don't have > strong feelings one way or another. > Either would be fine, I think. > On another note, my personal opinion is that at least as far as ext4 > is concerned, i_version on disk's only use is for NFS's convenience, Technically, IMA uses it too, but it needs the same behavior as NFSv4. > and so I have absolutely no problem with changing how and when > i_version gets updated modulo concerns about impacting performance. > That's one of the reasons why being able to update i_version only > lazily, so that if we had, say, some workload that was doing O_DIRECT > writes followed by fdatasync(), there wouldn't be any obligation to > flush the inode out to disk just because we had bumped i_version > appeals to me. > i_version only changes now if someone has queried it since it was last changed. That makes a huge difference in performance. We can try to optimize it further, but it probably wouldn't move the needle much under real workloads. > But aside from that, I don't consider when i_version gets updated on > disk, especially what the semantics are after a crash, and if we need > to change things so that NFS can be more performant, I'm happy to > accomodate. One of the reasons why we implemented the ext4 fast > commit feature was to improve performance for NFS workloads. > > I know some XFS developers have some concerns here, but I just wanted > to make it be explicit that (a) I'm not aware of any users who are > depending on the i_version on-disk semantics, and (b) if they are > depending on something which as far as I'm concerned in an internal > implementation detail, we've made no promises to them, and they can > get to keep both pieces. :-) This is especially since up until now, > there is no supported, portable userspace interface to make i_version > available to userspace. > Great! That's what I was hoping for with ext4. Would you be willing to pick up these two patches for v6.1? https://lore.kernel.org/linux-ext4/20220908172448.208585-3-jlayton@kernel.org/T/#u https://lore.kernel.org/linux-ext4/20220908172448.208585-4-jlayton@kernel.org/T/#u They should be able to go in independently of the rest of the series and I don't forsee any big changes to them. Thanks,
On Fri, Sep 09, 2022 at 08:47:17AM -0400, Jeff Layton wrote: > > i_version only changes now if someone has queried it since it was last > changed. That makes a huge difference in performance. We can try to > optimize it further, but it probably wouldn't move the needle much under > real workloads. Good point. And to be clear, from NFS's perspective, you only need to have i_version bumped if there is a user-visible change to the file. --- with an explicit exception here of the FIEMAP system call, since in the case of a delayed allocation, FIEMAP might change from reporting: ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 0.. 0: 0: last,unknown_loc,delalloc,eof to this: ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 190087172.. 190087172: 1: last,eof after a sync(2) or fsync(2) call, or after time passes. > Great! That's what I was hoping for with ext4. Would you be willing to > pick up these two patches for v6.1? > > https://lore.kernel.org/linux-ext4/20220908172448.208585-3-jlayton@kernel.org/T/#u > https://lore.kernel.org/linux-ext4/20220908172448.208585-4-jlayton@kernel.org/T/#u I think you mean: https://lore.kernel.org/linux-ext4/20220908172448.208585-2-jlayton@kernel.org/T/#u https://lore.kernel.org/linux-ext4/20220908172448.208585-3-jlayton@kernel.org/T/#u Right? BTW, sorry for not responding to these patches earlier; between preparing for the various Linux conferences in Dublin next week, and being in Zurich and meeting with colleagues at $WORK all of this week, I'm a bit behind on my patch reviews. Cheers, - Ted
On Fri, 2022-09-09 at 09:48 -0400, Theodore Ts'o wrote: > On Fri, Sep 09, 2022 at 08:47:17AM -0400, Jeff Layton wrote: > > > > i_version only changes now if someone has queried it since it was last > > changed. That makes a huge difference in performance. We can try to > > optimize it further, but it probably wouldn't move the needle much under > > real workloads. > > Good point. And to be clear, from NFS's perspective, you only need to > have i_version bumped if there is a user-visible change to the > file. --- with an explicit exception here of the FIEMAP system call, > since in the case of a delayed allocation, FIEMAP might change from > reporting: > > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 0: 0.. 0: 0: last,unknown_loc,delalloc,eof > > to this: > > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 0: 190087172.. 190087172: 1: last,eof > > after a sync(2) or fsync(2) call, or after time passes. > In general, we want to bump i_version if the ctime changes. I'm guessing that we don't change ctime on a delalloc? If it's not visible to NFS, then NFS won't care about it. We can't project FIEMAP info across the wire at this time, so we'd probably like to avoid seeing an i_version bump in due to delalloc. > > Great! That's what I was hoping for with ext4. Would you be willing to > > pick up these two patches for v6.1? > > > > https://lore.kernel.org/linux-ext4/20220908172448.208585-3-jlayton@kernel.org/T/#u > > https://lore.kernel.org/linux-ext4/20220908172448.208585-4-jlayton@kernel.org/T/#u > > I think you mean: > > https://lore.kernel.org/linux-ext4/20220908172448.208585-2-jlayton@kernel.org/T/#u > https://lore.kernel.org/linux-ext4/20220908172448.208585-3-jlayton@kernel.org/T/#u > > Right? > > BTW, sorry for not responding to these patches earlier; between > preparing for the various Linux conferences in Dublin next week, and > being in Zurich and meeting with colleagues at $WORK all of this week, > I'm a bit behind on my patch reviews. > No worries. As long as they're on your radar, that's fine. Thanks!
On Fri, Sep 09, 2022 at 10:43:30AM -0400, Jeff Layton wrote: > In general, we want to bump i_version if the ctime changes. I'm guessing > that we don't change ctime on a delalloc? If it's not visible to NFS, > then NFS won't care about it. We can't project FIEMAP info across the > wire at this time, so we'd probably like to avoid seeing an i_version > bump in due to delalloc. Right, currently nothing user-visible changes when delayed allocation is resolved; ctime isn't bumped, and i_version shouldn't be bumped either. If we crash before delayed allocation is resolved, there might be cases (mounting with data=writeback is the one which I'm most worried about, but I haven't experimented to be sure) where the inode might become a zero-length file after the reboot without i_version or ctime changing, but given that NFS forces a fsync(2) before it acknowledges a client request, that shouldn't be an issue for NFS. This is where as far I'm concerned, for ext4, i_version has only one customer to keep happy, and it's NFS. :-) Now, if we expose i_version via statx(2), we might need to be a tad bit more careful about what semantics we guarantee to userspace, especially with respect to what might be returned before and after a crash recovery. If we can leave things such that there is maximal freedom for file system implementations, that would be my preference. - Ted
On Thu, Sep 08, 2022 at 03:07:58PM -0400, Jeff Layton wrote: > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote: > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: > > > Yeah, ok. That does make some sense. So we would mix this into the > > > i_version instead of the ctime when it was available. Preferably, we'd > > > mix that in when we store the i_version rather than adding it afterward. > > > > > > Ted, how would we access this? Maybe we could just add a new (generic) > > > super_block field for this that ext4 (and other filesystems) could > > > populate at mount time? > > > > Couldn't the filesystem just return an ino_version that already includes > > it? > > > > Yes. That's simple if we want to just fold it in during getattr. If we > want to fold that into the values stored on disk, then I'm a little less > clear on how that will work. > > Maybe I need a concrete example of how that will work: > > Suppose we have an i_version value X with the previous crash counter > already factored in that makes it to disk. We hand out a newer version > X+1 to a client, but that value never makes it to disk. > > The machine crashes and comes back up, and we get a query for i_version > and it comes back as X. Fine, it's an old version. Now there is a write. > What do we do to ensure that the new value doesn't collide with X+1? I was assuming we could partition i_version's 64 bits somehow: e.g., top 16 bits store the crash counter. You increment the i_version by: 1) replacing the top bits by the new crash counter, if it has changed, and 2) incrementing. Do the numbers work out? 2^16 mounts after unclean shutdowns sounds like a lot for one filesystem, as does 2^48 changes to a single file, but people do weird things. Maybe there's a better partitioning, or some more flexible way of maintaining an i_version that still allows you to identify whether a given i_version preceded a crash. --b.
On Fri, 2022-09-09 at 11:45 -0400, J. Bruce Fields wrote: > On Thu, Sep 08, 2022 at 03:07:58PM -0400, Jeff Layton wrote: > > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote: > > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: > > > > Yeah, ok. That does make some sense. So we would mix this into the > > > > i_version instead of the ctime when it was available. Preferably, we'd > > > > mix that in when we store the i_version rather than adding it afterward. > > > > > > > > Ted, how would we access this? Maybe we could just add a new (generic) > > > > super_block field for this that ext4 (and other filesystems) could > > > > populate at mount time? > > > > > > Couldn't the filesystem just return an ino_version that already includes > > > it? > > > > > > > Yes. That's simple if we want to just fold it in during getattr. If we > > want to fold that into the values stored on disk, then I'm a little less > > clear on how that will work. > > > > Maybe I need a concrete example of how that will work: > > > > Suppose we have an i_version value X with the previous crash counter > > already factored in that makes it to disk. We hand out a newer version > > X+1 to a client, but that value never makes it to disk. > > > > The machine crashes and comes back up, and we get a query for i_version > > and it comes back as X. Fine, it's an old version. Now there is a write. > > What do we do to ensure that the new value doesn't collide with X+1? > > I was assuming we could partition i_version's 64 bits somehow: e.g., top > 16 bits store the crash counter. You increment the i_version by: 1) > replacing the top bits by the new crash counter, if it has changed, and > 2) incrementing. > > Do the numbers work out? 2^16 mounts after unclean shutdowns sounds > like a lot for one filesystem, as does 2^48 changes to a single file, > but people do weird things. Maybe there's a better partitioning, or > some more flexible way of maintaining an i_version that still allows you > to identify whether a given i_version preceded a crash. > We consume one bit to keep track of the "seen" flag, so it would be a 16+47 split. I assume that we'd also reset the version counter to 0 when the crash counter changes? Maybe that doesn't matter as long as we don't overflow into the crash counter. I'm not sure we can get away with 16 bits for the crash counter, as it'll leave us subject to the version counter wrapping after a long uptimes. If you increment a counter every nanosecond, how long until that counter wraps? With 63 bits, that's 292 years (and change). With 16+47 bits, that's less than two days. An 8+55 split would give us ~416 days which seems a bit more reasonable? For NFS, we can probably live with even less bits in the crash counter. If the crash counter changes, then that means the NFS server itself has (likely) also crashed. The client will have to reestablish sockets, reclaim, etc. It should get new attributes for the inodes it cares about at that time.
>>>>> "Jeff" == Jeff Layton <jlayton@kernel.org> writes: > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote: >> On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: >> > Yeah, ok. That does make some sense. So we would mix this into the >> > i_version instead of the ctime when it was available. Preferably, we'd >> > mix that in when we store the i_version rather than adding it afterward. >> > >> > Ted, how would we access this? Maybe we could just add a new (generic) >> > super_block field for this that ext4 (and other filesystems) could >> > populate at mount time? >> >> Couldn't the filesystem just return an ino_version that already includes >> it? >> > Yes. That's simple if we want to just fold it in during getattr. If we > want to fold that into the values stored on disk, then I'm a little less > clear on how that will work. I wonder if this series should also include some updates to the various xfstests to hopefully document in code what this statx() call will do in various situations. Or at least document how to test it in some manner? Especially since it's layers on top of layers to make this work. My assumption is that if the underlying filesystem doesn't support the new values, it just returns 0 or c_time? John
On Fri, 2022-09-09 at 16:41 +1000, NeilBrown wrote: > > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > > > On Fri, 2022-09-09 at 01:10 +0000, Trond Myklebust wrote: > > > > > > On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote: > > > > > > > > On Fri, 09 Sep 2022, NeilBrown wrote: > > > > > > > > > > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > IOW: the minimal condition needs to be that for all cases > > > > > > > > > > > > below, > > > > > > > > > > > > the > > > > > > > > > > > > application reads 'state B' as having occurred if any data was > > > > > > > > > > > > committed to disk before the crash. > > > > > > > > > > > > > > > > > > > > > > > > Application Filesystem > > > > > > > > > > > > =========== ========= > > > > > > > > > > > > read change attr <- 'state A' > > > > > > > > > > > > read data <- 'state A' > > > > > > > > > > > > write data -> 'state B' > > > > > > > > > > > > <crash>+<reboot> > > > > > > > > > > > > read change attr <- 'state B' > > > > > > > > > > > > > > > > > > > > The important thing here is to not see 'state A'. Seeing 'state > > > > > > > > > > C' > > > > > > > > > > should be acceptable. Worst case we could merge in wall-clock > > > > > > > > > > time > > > > > > > > > > of > > > > > > > > > > system boot, but the filesystem should be able to be more helpful > > > > > > > > > > than > > > > > > > > > > that. > > > > > > > > > > > > > > > > > > > > > > > > > > Actually, without the crash+reboot it would still be acceptable to > > > > > > > > see > > > > > > > > "state A" at the end there - but preferably not for long. > > > > > > > > From the NFS perspective, the changeid needs to update by the time > > > > > > > > of > > > > > > > > a > > > > > > > > close or unlock (so it is visible to open or lock), but before that > > > > > > > > it > > > > > > > > is just best-effort. > > > > > > > > > > > > Nope. That will inevitably lead to data corruption, since the > > > > > > application might decide to use the data from state A instead of > > > > > > revalidating it. > > > > > > > > > > > > > > The point is, NFS is not the only potential use case for change > > > > attributes. We wouldn't be bothering to discuss statx() if it was. > > > > My understanding is that it was primarily a desire to add fstests to > > exercise the i_version which motivated the statx extension. > > Obviously we should prepare for other uses though. > > Mainly. Also, userland nfs servers might also like this for obvious reasons. For now though, in the v5 set, I've backed off on trying to expose this to userland in favor of trying to just clean up the internal implementation. I'd still like to expose this via statx if possible, but I don't want to get too bogged down in interface design just now as we have Real Bugs to fix. That patchset should make it simple to expose it later though. > > > > > > > > I could be using O_DIRECT, and all the tricks in order to ensure > > > > that > > > > my stock broker application (to choose one example) has access > > > > to the > > > > absolute very latest prices when I'm trying to execute a trade. > > > > When the filesystem then says 'the prices haven't changed since > > > > your > > > > last read because the change attribute on the database file is > > > > the > > > > same' in response to a statx() request with the > > > > AT_STATX_FORCE_SYNC > > > > flag set, then why shouldn't my application be able to assume it > > > > can > > > > serve those prices right out of memory instead of having to go > > > > to disk? > > > > I would think that such an application would be using inotify rather > > than having to poll. But certainly we should have a clear statement > > of > > quality-of-service parameters in the documentation. > > If we agree that perfect atomicity is what we want to promise, and > > that > > the cost to the filesystem and the statx call is acceptable, then so > > be it. > > > > My point wasn't to say that atomicity is bad. It was that: > > - if the i_version change is visible before the change itself is > > visible, then that is a correctness problem. > > - if the i_version change is only visible some time after the > > change > > itself is visible, then that is a quality-of-service issue. > > I cannot see any room for debating the first. I do see some room to > > debate the second. > > > > Cached writes, directory ops, and attribute changes are, I think, > > easy > > enough to provide truly atomic i_version updates with the change > > being > > visible. > > > > Changes to a shared memory-mapped files is probably the hardest to > > provide timely i_version updates for. We might want to document an > > explicit exception for those. Alternately each request for > > i_version > > would need to find all pages that are writable, remap them read-only > > to > > catch future writes, then update i_version if any were writable > > (i.e. > > ->mkwrite had been called). That is the only way I can think of to > > provide atomicity. > > I don't think we really want to make i_version bumps that expensive. Documenting that you can't expect perfect consistency vs. mmap with NFS seems like the best thing to do. We do our best, but that sort of synchronization requires real locking. > > O_DIRECT writes are a little easier than mmapped files. I suspect we > > should update the i_version once the device reports that the write is > > complete, but a parallel reader could have seem some of the write before > > that moment. True atomicity could only be provided by taking some > > exclusive lock that blocked all O_DIRECT writes. Jeff seems to be > > suggesting this, but I doubt the stock broker application would be > > willing to make the call in that case. I don't think I would either. Well, only blocked for long enough to run the getattr. Granted, with a slow underlying filesystem that can take a while. To summarize, there are two main uses for the change attr in NFSv4: 1/ to provide change_info4 for directory morphing operations (CREATE, LINK, OPEN, REMOVE, and RENAME). It turns out that this is already atomic in the current nfsd code (AFAICT) by virtue of the fact that we hold the i_rwsem exclusively over these operations. The change attr is also queried pre and post while the lock is held, so that should ensure that we get true atomicity for this. 2/ as an adjunct for the ctime when fetching attributes to validate caches. We don't expect perfect consistency between read (and readlike) operations and GETATTR, even when they're in the same compound. IOW, a READ+GETATTR compound can legally give you a short (or zero- length) read, and then the getattr indicates a size that is larger than where the READ data stops, due to a write or truncate racing in after the read. Ideally, the attributes in the GETATTR reply should be consistent between themselves though. IOW, all of the attrs should accurately represent the state of the file at a single point in time. change+size+times+etc. should all be consistent with one another. I think we get all of this by taking the inode_lock around the vfs_getattr call in nfsd4_encode_fattr. It may not be the most elegant solution, but it should give us the atomicity we need, and it doesn't require adding extra operations or locking to the write codepaths. We could also consider less invasive ways to achieve this (maybe some sort of seqretry loop around the vfs_getattr call?), but I'd rather not do extra work in the write codepaths if we can get away with it.
On Fri, Sep 09, 2022 at 12:36:29PM -0400, Jeff Layton wrote: > On Fri, 2022-09-09 at 11:45 -0400, J. Bruce Fields wrote: > > On Thu, Sep 08, 2022 at 03:07:58PM -0400, Jeff Layton wrote: > > > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote: > > > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: > > > > > Yeah, ok. That does make some sense. So we would mix this into the > > > > > i_version instead of the ctime when it was available. Preferably, we'd > > > > > mix that in when we store the i_version rather than adding it afterward. > > > > > > > > > > Ted, how would we access this? Maybe we could just add a new (generic) > > > > > super_block field for this that ext4 (and other filesystems) could > > > > > populate at mount time? > > > > > > > > Couldn't the filesystem just return an ino_version that already includes > > > > it? > > > > > > > > > > Yes. That's simple if we want to just fold it in during getattr. If we > > > want to fold that into the values stored on disk, then I'm a little less > > > clear on how that will work. > > > > > > Maybe I need a concrete example of how that will work: > > > > > > Suppose we have an i_version value X with the previous crash counter > > > already factored in that makes it to disk. We hand out a newer version > > > X+1 to a client, but that value never makes it to disk. > > > > > > The machine crashes and comes back up, and we get a query for i_version > > > and it comes back as X. Fine, it's an old version. Now there is a write. > > > What do we do to ensure that the new value doesn't collide with X+1? > > > > I was assuming we could partition i_version's 64 bits somehow: e.g., top > > 16 bits store the crash counter. You increment the i_version by: 1) > > replacing the top bits by the new crash counter, if it has changed, and > > 2) incrementing. > > > > Do the numbers work out? 2^16 mounts after unclean shutdowns sounds > > like a lot for one filesystem, as does 2^48 changes to a single file, > > but people do weird things. Maybe there's a better partitioning, or > > some more flexible way of maintaining an i_version that still allows you > > to identify whether a given i_version preceded a crash. > > > > We consume one bit to keep track of the "seen" flag, so it would be a > 16+47 split. I assume that we'd also reset the version counter to 0 when > the crash counter changes? Maybe that doesn't matter as long as we don't > overflow into the crash counter. > > I'm not sure we can get away with 16 bits for the crash counter, as > it'll leave us subject to the version counter wrapping after a long > uptimes. > > If you increment a counter every nanosecond, how long until that counter > wraps? With 63 bits, that's 292 years (and change). With 16+47 bits, > that's less than two days. An 8+55 split would give us ~416 days which > seems a bit more reasonable? Though now it's starting to seem a little limiting to allow only 2^8 mounts after unclean shutdowns. Another way to think of it might be: multiply that 8-bit crash counter by 2^48, and think of it as a 64-bit value that we believe (based on practical limits on how many times you can modify a single file) is gauranteed to be larger than any i_version that we gave out before the most recent crash. Our goal is to ensure that after a crash, any *new* i_versions that we give out or write to disk are larger than any that have previously been given out. We can do that by ensuring that they're equal to at least that old maximum. So think of the 64-bit value we're storing in the superblock as a ceiling on i_version values across all the filesystem's inodes. Call it s_version_max or something. We also need to know what the maximum was before the most recent crash. Call that s_version_max_old. Then we could get correct behavior if we generated i_versions with something like: i_version++; if (i_version < s_version_max_old) i_version = s_version_max_old; if (i_version > s_version_max) s_version_max = i_version + 1; But that last step makes this ludicrously expensive, because for this to be safe across crashes we need to update that value on disk as well, and we need to do that frequently. Fortunately, s_version_max doesn't have to be a tight bound at all. We can easily just initialize it to, say, 2^40, and only bump it by 2^40 at a time. And recognize when we're running up against it way ahead of time, so we only need to say "here's an updated value, could you please make sure it gets to disk sometime in the next twenty minutes"? (Numbers made up.) Sorry, that was way too many words. But I think something like that could work, and make it very difficult to hit any hard limits, and actually not be too complicated?? Unless I missed something. --b.
On Thu, Sep 08, 2022 at 10:40:43AM +1000, NeilBrown wrote: > We do hold i_rwsem today. I'm working on changing that. Preserving > atomic directory changeinfo will be a challenge. The only mechanism I > can think if is to pass a "u64*" to all the directory modification ops, > and they fill in the version number at the point where it is incremented > (inode_maybe_inc_iversion_return()). The (nfsd) caller assumes that > "before" was one less than "after". If you don't want to internally > require single increments, then you would need to pass a 'u64 [2]' to > get two iversions back. Are you serious? What kind of boilerplate would that inflict on the filesystems not, er, opting in for that... scalability improvement experiment?
On Fri, 09 Sep 2022, Jeff Layton wrote: > > The machine crashes and comes back up, and we get a query for i_version > and it comes back as X. Fine, it's an old version. Now there is a write. > What do we do to ensure that the new value doesn't collide with X+1? (I missed this bit in my earlier reply..) How is it "Fine" to see an old version? The file could have changed without the version changing. And I thought one of the goals of the crash-count was to be able to provide a monotonic change id. NeilBrown
On Sat, 10 Sep 2022, Jeff Layton wrote: > On Fri, 2022-09-09 at 16:41 +1000, NeilBrown wrote: > > > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > > > > On Fri, 2022-09-09 at 01:10 +0000, Trond Myklebust wrote: > > > > > > > On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote: > > > > > > > > > On Fri, 09 Sep 2022, NeilBrown wrote: > > > > > > > > > > > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > IOW: the minimal condition needs to be that for all cases > > > > > > > > > > > > > below, > > > > > > > > > > > > > the > > > > > > > > > > > > > application reads 'state B' as having occurred if any data was > > > > > > > > > > > > > committed to disk before the crash. > > > > > > > > > > > > > > > > > > > > > > > > > > Application Filesystem > > > > > > > > > > > > > =========== ========= > > > > > > > > > > > > > read change attr <- 'state A' > > > > > > > > > > > > > read data <- 'state A' > > > > > > > > > > > > > write data -> 'state B' > > > > > > > > > > > > > <crash>+<reboot> > > > > > > > > > > > > > read change attr <- 'state B' > > > > > > > > > > > > > > > > > > > > > > The important thing here is to not see 'state A'. Seeing 'state > > > > > > > > > > > C' > > > > > > > > > > > should be acceptable. Worst case we could merge in wall-clock > > > > > > > > > > > time > > > > > > > > > > > of > > > > > > > > > > > system boot, but the filesystem should be able to be more helpful > > > > > > > > > > > than > > > > > > > > > > > that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Actually, without the crash+reboot it would still be acceptable to > > > > > > > > > see > > > > > > > > > "state A" at the end there - but preferably not for long. > > > > > > > > > From the NFS perspective, the changeid needs to update by the time > > > > > > > > > of > > > > > > > > > a > > > > > > > > > close or unlock (so it is visible to open or lock), but before that > > > > > > > > > it > > > > > > > > > is just best-effort. > > > > > > > > > > > > > > Nope. That will inevitably lead to data corruption, since the > > > > > > > application might decide to use the data from state A instead of > > > > > > > revalidating it. > > > > > > > > > > > > > > > > > The point is, NFS is not the only potential use case for change > > > > > attributes. We wouldn't be bothering to discuss statx() if it was. > > > > > > My understanding is that it was primarily a desire to add fstests to > > > exercise the i_version which motivated the statx extension. > > > Obviously we should prepare for other uses though. > > > > > Mainly. Also, userland nfs servers might also like this for obvious > reasons. For now though, in the v5 set, I've backed off on trying to > expose this to userland in favor of trying to just clean up the internal > implementation. > > I'd still like to expose this via statx if possible, but I don't want to > get too bogged down in interface design just now as we have Real Bugs to > fix. That patchset should make it simple to expose it later though. > > > > > > > > > > > I could be using O_DIRECT, and all the tricks in order to ensure > > > > > that > > > > > my stock broker application (to choose one example) has access > > > > > to the > > > > > absolute very latest prices when I'm trying to execute a trade. > > > > > When the filesystem then says 'the prices haven't changed since > > > > > your > > > > > last read because the change attribute on the database file is > > > > > the > > > > > same' in response to a statx() request with the > > > > > AT_STATX_FORCE_SYNC > > > > > flag set, then why shouldn't my application be able to assume it > > > > > can > > > > > serve those prices right out of memory instead of having to go > > > > > to disk? > > > > > > I would think that such an application would be using inotify rather > > > than having to poll. But certainly we should have a clear statement > > > of > > > quality-of-service parameters in the documentation. > > > If we agree that perfect atomicity is what we want to promise, and > > > that > > > the cost to the filesystem and the statx call is acceptable, then so > > > be it. > > > > > > My point wasn't to say that atomicity is bad. It was that: > > > - if the i_version change is visible before the change itself is > > > visible, then that is a correctness problem. > > > - if the i_version change is only visible some time after the > > > change > > > itself is visible, then that is a quality-of-service issue. > > > I cannot see any room for debating the first. I do see some room to > > > debate the second. > > > > > > Cached writes, directory ops, and attribute changes are, I think, > > > easy > > > enough to provide truly atomic i_version updates with the change > > > being > > > visible. > > > > > > Changes to a shared memory-mapped files is probably the hardest to > > > provide timely i_version updates for. We might want to document an > > > explicit exception for those. Alternately each request for > > > i_version > > > would need to find all pages that are writable, remap them read-only > > > to > > > catch future writes, then update i_version if any were writable > > > (i.e. > > > ->mkwrite had been called). That is the only way I can think of to > > > provide atomicity. > > > > > I don't think we really want to make i_version bumps that expensive. > Documenting that you can't expect perfect consistency vs. mmap with NFS > seems like the best thing to do. We do our best, but that sort of > synchronization requires real locking. > > > > O_DIRECT writes are a little easier than mmapped files. I suspect we > > > should update the i_version once the device reports that the write is > > > complete, but a parallel reader could have seem some of the write before > > > that moment. True atomicity could only be provided by taking some > > > exclusive lock that blocked all O_DIRECT writes. Jeff seems to be > > > suggesting this, but I doubt the stock broker application would be > > > willing to make the call in that case. I don't think I would either. > > Well, only blocked for long enough to run the getattr. Granted, with a > slow underlying filesystem that can take a while. Maybe I misunderstand, but this doesn't seem to make much sense. If you want i_version updates to appear to be atomic w.r.t O_DIRECT writes, then you need to prevent accessing the i_version while any write is on-going. At that time there is no meaningful value for i_version. So you need a lock (At least shared) around the actual write, and you need an exclusive lock around the get_i_version(). So accessing the i_version would have to wait for all pending O_DIRECT writes to complete, and would block any new O_DIRECT writes from starting. This could be expensive. There is not currently any locking around O_DIRECT writes. You cannot synchronise with them. The best you can do is update the i_version immediately after all the O_DIRECT writes in a single request complete. > > To summarize, there are two main uses for the change attr in NFSv4: > > 1/ to provide change_info4 for directory morphing operations (CREATE, > LINK, OPEN, REMOVE, and RENAME). It turns out that this is already > atomic in the current nfsd code (AFAICT) by virtue of the fact that we > hold the i_rwsem exclusively over these operations. The change attr is > also queried pre and post while the lock is held, so that should ensure > that we get true atomicity for this. Yes, directory ops are relatively easy. > > 2/ as an adjunct for the ctime when fetching attributes to validate > caches. We don't expect perfect consistency between read (and readlike) > operations and GETATTR, even when they're in the same compound. > > IOW, a READ+GETATTR compound can legally give you a short (or zero- > length) read, and then the getattr indicates a size that is larger than > where the READ data stops, due to a write or truncate racing in after > the read. I agree that atomicity is neither necessary nor practical. Ordering is important though. I don't think a truncate(0) racing with a READ can credibly result in a non-zero size AFTER a zero-length read. A truncate that extends the size could have that effect though. > > Ideally, the attributes in the GETATTR reply should be consistent > between themselves though. IOW, all of the attrs should accurately > represent the state of the file at a single point in time. > change+size+times+etc. should all be consistent with one another. > > I think we get all of this by taking the inode_lock around the > vfs_getattr call in nfsd4_encode_fattr. It may not be the most elegant > solution, but it should give us the atomicity we need, and it doesn't > require adding extra operations or locking to the write codepaths. Explicit attribute changes (chown/chmod/utimes/truncate etc) are always done under the inode lock. Implicit changes via inode_update_time() are not (though xfs does take the lock, ext4 doesn't, haven't checked others). So taking the inode lock won't ensure those are internally consistent. I think using inode_lock_shared() is acceptable. It doesn't promise perfect atomicity, but it is probably good enough. We'd need a good reason to want perfect atomicity to go further, and I cannot think of one. NeilBrown > > We could also consider less invasive ways to achieve this (maybe some > sort of seqretry loop around the vfs_getattr call?), but I'd rather not > do extra work in the write codepaths if we can get away with it. > -- > Jeff Layton <jlayton@kernel.org> > >
On Fri, 09 Sep 2022, Jeff Layton wrote: > On Fri, 2022-09-09 at 08:29 +1000, NeilBrown wrote: > > On Thu, 08 Sep 2022, Jeff Layton wrote: > > > On Thu, 2022-09-08 at 10:40 +1000, NeilBrown wrote: > > > > On Thu, 08 Sep 2022, Jeff Layton wrote: > > > > > On Wed, 2022-09-07 at 13:55 +0000, Trond Myklebust wrote: > > > > > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote: > > > > > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote: > > > > > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote: > > > > > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote: > > > > > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote: > > > > > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with > > > > > > > > > > > respect to the > > > > > > > > > > > +other changes in the inode. On a write, for instance, the > > > > > > > > > > > i_version it usually > > > > > > > > > > > +incremented before the data is copied into the pagecache. > > > > > > > > > > > Therefore it is > > > > > > > > > > > +possible to see a new i_version value while a read still > > > > > > > > > > > shows the old data. > > > > > > > > > > > > > > > > > > > > Doesn't that make the value useless? > > > > > > > > > > > > > > > > > > > > > > > > > > > > No, I don't think so. It's only really useful for comparing to an > > > > > > > > > older > > > > > > > > > sample anyway. If you do "statx; read; statx" and the value > > > > > > > > > hasn't > > > > > > > > > changed, then you know that things are stable. > > > > > > > > > > > > > > > > I don't see how that helps. It's still possible to get: > > > > > > > > > > > > > > > > reader writer > > > > > > > > ------ ------ > > > > > > > > i_version++ > > > > > > > > statx > > > > > > > > read > > > > > > > > statx > > > > > > > > update page cache > > > > > > > > > > > > > > > > right? > > > > > > > > > > > > > > > > > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In > > > > > > > that case, maybe this is useless then other than for testing purposes > > > > > > > and userland NFS servers. > > > > > > > > > > > > > > Would it be better to not consume a statx field with this if so? What > > > > > > > could we use as an alternate interface? ioctl? Some sort of global > > > > > > > virtual xattr? It does need to be something per-inode. > > > > > > > > > > > > I don't see how a non-atomic change attribute is remotely useful even > > > > > > for NFS. > > > > > > > > > > > > The main problem is not so much the above (although NFS clients are > > > > > > vulnerable to that too) but the behaviour w.r.t. directory changes. > > > > > > > > > > > > If the server can't guarantee that file/directory/... creation and > > > > > > unlink are atomically recorded with change attribute updates, then the > > > > > > client has to always assume that the server is lying, and that it has > > > > > > to revalidate all its caches anyway. Cue endless readdir/lookup/getattr > > > > > > requests after each and every directory modification in order to check > > > > > > that some other client didn't also sneak in a change of their own. > > > > > > > > > > > > > > > > We generally hold the parent dir's inode->i_rwsem exclusively over most > > > > > important directory changes, and the times/i_version are also updated > > > > > while holding it. What we don't do is serialize reads of this value vs. > > > > > the i_rwsem, so you could see new directory contents alongside an old > > > > > i_version. Maybe we should be taking it for read when we query it on a > > > > > directory? > > > > > > > > We do hold i_rwsem today. I'm working on changing that. Preserving > > > > atomic directory changeinfo will be a challenge. The only mechanism I > > > > can think if is to pass a "u64*" to all the directory modification ops, > > > > and they fill in the version number at the point where it is incremented > > > > (inode_maybe_inc_iversion_return()). The (nfsd) caller assumes that > > > > "before" was one less than "after". If you don't want to internally > > > > require single increments, then you would need to pass a 'u64 [2]' to > > > > get two iversions back. > > > > > > > > > > That's a major redesign of what the i_version counter is today. It may > > > very well end up being needed, but that's going to touch a lot of stuff > > > in the VFS. Are you planning to do that as a part of your locking > > > changes? > > > > > > > "A major design"? How? The "one less than" might be, but allowing a > > directory morphing op to fill in a "u64 [2]" is just a new interface to > > existing data. One that allows fine grained atomicity. > > > > This would actually be really good for NFS. nfs_mkdir (for example) > > could easily have access to the atomic pre/post changedid provided by > > the server, and so could easily provide them to nfsd. > > > > I'm not planning to do this as part of my locking changes. In the first > > instance only NFS changes behaviour, and it doesn't provide atomic > > changeids, so there is no loss of functionality. > > > > When some other filesystem wants to opt-in to shared-locking on > > directories - that would be the time to push through a better interface. > > > > I think nfsd does provide atomic changeids for directory operations > currently. AFAICT, any operation where we're changing directory contents > is done while holding the i_rwsem exclusively, and we hold that lock > over the pre and post i_version fetch for the change_info4. > > If you change nfsd to allow parallel directory morphing operations > without addressing this, then I think that would be a regression. Of course. As I said, in the first instance only NFS allows parallel directory morphing ops, and NFS doesn't provide atomic pre/post already. No regression. Parallel directory morphing is opt-in - at least until all file systems can be converted and these other issues are resolved. > > > > > > > > > > > > > Achieving atomicity with file writes though is another matter entirely. > > > > > I'm not sure that's even doable or how to approach it if so. > > > > > Suggestions? > > > > > > > > Call inode_maybe_inc_version(page->host) in __folio_mark_dirty() ?? > > > > > > > > > > Writes can cover multiple folios so we'd be doing several increments per > > > write. Maybe that's ok? Should we also be updating the ctime at that > > > point as well? > > > > You would only do several increments if something was reading the value > > concurrently, and then you really should to several increments for > > correctness. > > > > Agreed. > > > > > > > Fetching the i_version under the i_rwsem is probably sufficient to fix > > > this though. Most of the write_iter ops already bump the i_version while > > > holding that lock, so this wouldn't add any extra locking to the write > > > codepaths. > > > > Adding new locking doesn't seem like a good idea. It's bound to have > > performance implications. It may well end up serialising the directory > > op that I'm currently trying to make parallelisable. > > > > The new locking would only be in the NFSv4 GETATTR codepath: > > https://lore.kernel.org/linux-nfs/20220908172448.208585-9-jlayton@kernel.org/T/#u > > Maybe we'd still better off taking a hit in the write codepath instead > of doing this, but with this, most of the penalty would be paid by nfsd > which I would think would be preferred here. inode_lock_shard() would be acceptable here. inode_lock() is unnecessary. > > The problem of mmap writes is another matter though. Not sure what we > can do about that without making i_version bumps a lot more expensive. > Agreed. We need to document our way out of that one. NeilBrown > -- > Jeff Layton <jlayton@kernel.org> >
On Sun, 11 Sep 2022, Al Viro wrote: > On Thu, Sep 08, 2022 at 10:40:43AM +1000, NeilBrown wrote: > > > We do hold i_rwsem today. I'm working on changing that. Preserving > > atomic directory changeinfo will be a challenge. The only mechanism I > > can think if is to pass a "u64*" to all the directory modification ops, > > and they fill in the version number at the point where it is incremented > > (inode_maybe_inc_iversion_return()). The (nfsd) caller assumes that > > "before" was one less than "after". If you don't want to internally > > require single increments, then you would need to pass a 'u64 [2]' to > > get two iversions back. > > Are you serious? What kind of boilerplate would that inflict on the > filesystems not, er, opting in for that... scalability improvement > experiment? > Why would you think there would be any boiler plate? Only filesystems that opt in would need to do anything, and only when the caller asked (by passing a non-NULL array pointer). NeilBrown
On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote: > On Sat, 10 Sep 2022, Jeff Layton wrote: > > On Fri, 2022-09-09 at 16:41 +1000, NeilBrown wrote: > > > > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > > > > > On Fri, 2022-09-09 at 01:10 +0000, Trond Myklebust wrote: > > > > > > > > On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote: > > > > > > > > > > On Fri, 09 Sep 2022, NeilBrown wrote: > > > > > > > > > > > > On Fri, 09 Sep 2022, Trond Myklebust wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > IOW: the minimal condition needs to be that for all cases > > > > > > > > > > > > > > below, > > > > > > > > > > > > > > the > > > > > > > > > > > > > > application reads 'state B' as having occurred if any data was > > > > > > > > > > > > > > committed to disk before the crash. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Application Filesystem > > > > > > > > > > > > > > =========== ========= > > > > > > > > > > > > > > read change attr <- 'state A' > > > > > > > > > > > > > > read data <- 'state A' > > > > > > > > > > > > > > write data -> 'state B' > > > > > > > > > > > > > > <crash>+<reboot> > > > > > > > > > > > > > > read change attr <- 'state B' > > > > > > > > > > > > > > > > > > > > > > > > The important thing here is to not see 'state A'. Seeing 'state > > > > > > > > > > > > C' > > > > > > > > > > > > should be acceptable. Worst case we could merge in wall-clock > > > > > > > > > > > > time > > > > > > > > > > > > of > > > > > > > > > > > > system boot, but the filesystem should be able to be more helpful > > > > > > > > > > > > than > > > > > > > > > > > > that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Actually, without the crash+reboot it would still be acceptable to > > > > > > > > > > see > > > > > > > > > > "state A" at the end there - but preferably not for long. > > > > > > > > > > From the NFS perspective, the changeid needs to update by the time > > > > > > > > > > of > > > > > > > > > > a > > > > > > > > > > close or unlock (so it is visible to open or lock), but before that > > > > > > > > > > it > > > > > > > > > > is just best-effort. > > > > > > > > > > > > > > > > Nope. That will inevitably lead to data corruption, since the > > > > > > > > application might decide to use the data from state A instead of > > > > > > > > revalidating it. > > > > > > > > > > > > > > > > > > > > The point is, NFS is not the only potential use case for change > > > > > > attributes. We wouldn't be bothering to discuss statx() if it was. > > > > > > > > My understanding is that it was primarily a desire to add fstests to > > > > exercise the i_version which motivated the statx extension. > > > > Obviously we should prepare for other uses though. > > > > > > > > Mainly. Also, userland nfs servers might also like this for obvious > > reasons. For now though, in the v5 set, I've backed off on trying to > > expose this to userland in favor of trying to just clean up the internal > > implementation. > > > > I'd still like to expose this via statx if possible, but I don't want to > > get too bogged down in interface design just now as we have Real Bugs to > > fix. That patchset should make it simple to expose it later though. > > > > > > > > > > > > > > I could be using O_DIRECT, and all the tricks in order to ensure > > > > > > that > > > > > > my stock broker application (to choose one example) has access > > > > > > to the > > > > > > absolute very latest prices when I'm trying to execute a trade. > > > > > > When the filesystem then says 'the prices haven't changed since > > > > > > your > > > > > > last read because the change attribute on the database file is > > > > > > the > > > > > > same' in response to a statx() request with the > > > > > > AT_STATX_FORCE_SYNC > > > > > > flag set, then why shouldn't my application be able to assume it > > > > > > can > > > > > > serve those prices right out of memory instead of having to go > > > > > > to disk? > > > > > > > > I would think that such an application would be using inotify rather > > > > than having to poll. But certainly we should have a clear statement > > > > of > > > > quality-of-service parameters in the documentation. > > > > If we agree that perfect atomicity is what we want to promise, and > > > > that > > > > the cost to the filesystem and the statx call is acceptable, then so > > > > be it. > > > > > > > > My point wasn't to say that atomicity is bad. It was that: > > > > - if the i_version change is visible before the change itself is > > > > visible, then that is a correctness problem. > > > > - if the i_version change is only visible some time after the > > > > change > > > > itself is visible, then that is a quality-of-service issue. > > > > I cannot see any room for debating the first. I do see some room to > > > > debate the second. > > > > > > > > Cached writes, directory ops, and attribute changes are, I think, > > > > easy > > > > enough to provide truly atomic i_version updates with the change > > > > being > > > > visible. > > > > > > > > Changes to a shared memory-mapped files is probably the hardest to > > > > provide timely i_version updates for. We might want to document an > > > > explicit exception for those. Alternately each request for > > > > i_version > > > > would need to find all pages that are writable, remap them read-only > > > > to > > > > catch future writes, then update i_version if any were writable > > > > (i.e. > > > > ->mkwrite had been called). That is the only way I can think of to > > > > provide atomicity. > > > > > > > > I don't think we really want to make i_version bumps that expensive. > > Documenting that you can't expect perfect consistency vs. mmap with NFS > > seems like the best thing to do. We do our best, but that sort of > > synchronization requires real locking. > > > > > > O_DIRECT writes are a little easier than mmapped files. I suspect we > > > > should update the i_version once the device reports that the write is > > > > complete, but a parallel reader could have seem some of the write before > > > > that moment. True atomicity could only be provided by taking some > > > > exclusive lock that blocked all O_DIRECT writes. Jeff seems to be > > > > suggesting this, but I doubt the stock broker application would be > > > > willing to make the call in that case. I don't think I would either. > > > > Well, only blocked for long enough to run the getattr. Granted, with a > > slow underlying filesystem that can take a while. > > Maybe I misunderstand, but this doesn't seem to make much sense. > > If you want i_version updates to appear to be atomic w.r.t O_DIRECT > writes, then you need to prevent accessing the i_version while any write > is on-going. At that time there is no meaningful value for i_version. > So you need a lock (At least shared) around the actual write, and you > need an exclusive lock around the get_i_version(). > So accessing the i_version would have to wait for all pending O_DIRECT > writes to complete, and would block any new O_DIRECT writes from > starting. > > This could be expensive. > > There is not currently any locking around O_DIRECT writes. You cannot > synchronise with them. > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold inode_lock_shared across the I/O. That was why patch #8 takes the inode_lock (exclusive) across the getattr. > The best you can do is update the i_version immediately after all the > O_DIRECT writes in a single request complete. > > > > > To summarize, there are two main uses for the change attr in NFSv4: > > > > 1/ to provide change_info4 for directory morphing operations (CREATE, > > LINK, OPEN, REMOVE, and RENAME). It turns out that this is already > > atomic in the current nfsd code (AFAICT) by virtue of the fact that we > > hold the i_rwsem exclusively over these operations. The change attr is > > also queried pre and post while the lock is held, so that should ensure > > that we get true atomicity for this. > > Yes, directory ops are relatively easy. > > > > > 2/ as an adjunct for the ctime when fetching attributes to validate > > caches. We don't expect perfect consistency between read (and readlike) > > operations and GETATTR, even when they're in the same compound. > > > > IOW, a READ+GETATTR compound can legally give you a short (or zero- > > length) read, and then the getattr indicates a size that is larger than > > where the READ data stops, due to a write or truncate racing in after > > the read. > > I agree that atomicity is neither necessary nor practical. Ordering is > important though. I don't think a truncate(0) racing with a READ can > credibly result in a non-zero size AFTER a zero-length read. A truncate > that extends the size could have that effect though. > > > > > Ideally, the attributes in the GETATTR reply should be consistent > > between themselves though. IOW, all of the attrs should accurately > > represent the state of the file at a single point in time. > > change+size+times+etc. should all be consistent with one another. > > > > I think we get all of this by taking the inode_lock around the > > vfs_getattr call in nfsd4_encode_fattr. It may not be the most elegant > > solution, but it should give us the atomicity we need, and it doesn't > > require adding extra operations or locking to the write codepaths. > > Explicit attribute changes (chown/chmod/utimes/truncate etc) are always > done under the inode lock. Implicit changes via inode_update_time() are > not (though xfs does take the lock, ext4 doesn't, haven't checked > others). So taking the inode lock won't ensure those are internally > consistent. > > I think using inode_lock_shared() is acceptable. It doesn't promise > perfect atomicity, but it is probably good enough. > > We'd need a good reason to want perfect atomicity to go further, and I > cannot think of one. > > Taking inode_lock_shared is sufficient to block out buffered and DAX writes. DIO writes sometimes only take the shared lock (e.g. when the data is already properly aligned). If we want to ensure the getattr doesn't run while _any_ writes are running, we'd need the exclusive lock. Maybe that's overkill, though it seems like we could have a race like this without taking inode_lock across the getattr: reader writer ----------------------------------------------------------------- i_version++ getattr read DIO write to backing store Given that we can't fully exclude mmap writes, maybe we can just document that mixing DIO or mmap writes on the server + NFS may not be fully cache coherent. > > > > > We could also consider less invasive ways to achieve this (maybe some > > sort of seqretry loop around the vfs_getattr call?), but I'd rather not > > do extra work in the write codepaths if we can get away with it. > > -- > > Jeff Layton <jlayton@kernel.org> > > > >
On Sun, 2022-09-11 at 08:13 +1000, NeilBrown wrote: > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > The machine crashes and comes back up, and we get a query for i_version > > and it comes back as X. Fine, it's an old version. Now there is a write. > > What do we do to ensure that the new value doesn't collide with X+1? > > (I missed this bit in my earlier reply..) > > How is it "Fine" to see an old version? > The file could have changed without the version changing. > And I thought one of the goals of the crash-count was to be able to > provide a monotonic change id. > "Fine" in the sense that we expect that to happen in this situation. It's not fine for the clients obviously, which is why we're discussing mitigation techniques.
On Sat, 2022-09-10 at 10:56 -0400, J. Bruce Fields wrote: > On Fri, Sep 09, 2022 at 12:36:29PM -0400, Jeff Layton wrote: > > On Fri, 2022-09-09 at 11:45 -0400, J. Bruce Fields wrote: > > > On Thu, Sep 08, 2022 at 03:07:58PM -0400, Jeff Layton wrote: > > > > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote: > > > > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: > > > > > > Yeah, ok. That does make some sense. So we would mix this into the > > > > > > i_version instead of the ctime when it was available. Preferably, we'd > > > > > > mix that in when we store the i_version rather than adding it afterward. > > > > > > > > > > > > Ted, how would we access this? Maybe we could just add a new (generic) > > > > > > super_block field for this that ext4 (and other filesystems) could > > > > > > populate at mount time? > > > > > > > > > > Couldn't the filesystem just return an ino_version that already includes > > > > > it? > > > > > > > > > > > > > Yes. That's simple if we want to just fold it in during getattr. If we > > > > want to fold that into the values stored on disk, then I'm a little less > > > > clear on how that will work. > > > > > > > > Maybe I need a concrete example of how that will work: > > > > > > > > Suppose we have an i_version value X with the previous crash counter > > > > already factored in that makes it to disk. We hand out a newer version > > > > X+1 to a client, but that value never makes it to disk. > > > > > > > > The machine crashes and comes back up, and we get a query for i_version > > > > and it comes back as X. Fine, it's an old version. Now there is a write. > > > > What do we do to ensure that the new value doesn't collide with X+1? > > > > > > I was assuming we could partition i_version's 64 bits somehow: e.g., top > > > 16 bits store the crash counter. You increment the i_version by: 1) > > > replacing the top bits by the new crash counter, if it has changed, and > > > 2) incrementing. > > > > > > Do the numbers work out? 2^16 mounts after unclean shutdowns sounds > > > like a lot for one filesystem, as does 2^48 changes to a single file, > > > but people do weird things. Maybe there's a better partitioning, or > > > some more flexible way of maintaining an i_version that still allows you > > > to identify whether a given i_version preceded a crash. > > > > > > > We consume one bit to keep track of the "seen" flag, so it would be a > > 16+47 split. I assume that we'd also reset the version counter to 0 when > > the crash counter changes? Maybe that doesn't matter as long as we don't > > overflow into the crash counter. > > > > I'm not sure we can get away with 16 bits for the crash counter, as > > it'll leave us subject to the version counter wrapping after a long > > uptimes. > > > > If you increment a counter every nanosecond, how long until that counter > > wraps? With 63 bits, that's 292 years (and change). With 16+47 bits, > > that's less than two days. An 8+55 split would give us ~416 days which > > seems a bit more reasonable? > > Though now it's starting to seem a little limiting to allow only 2^8 > mounts after unclean shutdowns. > > Another way to think of it might be: multiply that 8-bit crash counter > by 2^48, and think of it as a 64-bit value that we believe (based on > practical limits on how many times you can modify a single file) is > gauranteed to be larger than any i_version that we gave out before the > most recent crash. > > Our goal is to ensure that after a crash, any *new* i_versions that we > give out or write to disk are larger than any that have previously been > given out. We can do that by ensuring that they're equal to at least > that old maximum. > > So think of the 64-bit value we're storing in the superblock as a > ceiling on i_version values across all the filesystem's inodes. Call it > s_version_max or something. We also need to know what the maximum was > before the most recent crash. Call that s_version_max_old. > > Then we could get correct behavior if we generated i_versions with > something like: > > i_version++; > if (i_version < s_version_max_old) > i_version = s_version_max_old; > if (i_version > s_version_max) > s_version_max = i_version + 1; > > But that last step makes this ludicrously expensive, because for this to > be safe across crashes we need to update that value on disk as well, and > we need to do that frequently. > > Fortunately, s_version_max doesn't have to be a tight bound at all. We > can easily just initialize it to, say, 2^40, and only bump it by 2^40 at > a time. And recognize when we're running up against it way ahead of > time, so we only need to say "here's an updated value, could you please > make sure it gets to disk sometime in the next twenty minutes"? > (Numbers made up.) > > Sorry, that was way too many words. But I think something like that > could work, and make it very difficult to hit any hard limits, and > actually not be too complicated?? Unless I missed something. > That's not too many words -- I appreciate a good "for dummies" explanation! A scheme like that could work. It might be hard to do it without a spinlock or something, but maybe that's ok. Thinking more about how we'd implement this in the underlying filesystems: To do this we'd need 2 64-bit fields in the on-disk and in-memory superblocks for ext4, xfs and btrfs. On the first mount after a crash, the filesystem would need to bump s_version_max by the significant increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need to do that. Would there be a way to ensure that the new s_version_max value has made it to disk? Bumping it by a large value and hoping for the best might be ok for most cases, but there are always outliers, so it might be worthwhile to make an i_version increment wait on that if necessary.
* Jeff Layton: > To do this we'd need 2 64-bit fields in the on-disk and in-memory > superblocks for ext4, xfs and btrfs. On the first mount after a crash, > the filesystem would need to bump s_version_max by the significant > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need > to do that. > > Would there be a way to ensure that the new s_version_max value has made > it to disk? Bumping it by a large value and hoping for the best might be > ok for most cases, but there are always outliers, so it might be > worthwhile to make an i_version increment wait on that if necessary. How common are unclean shutdowns in practice? Do ex64/XFS/btrfs keep counters in the superblocks for journal replays that can be read easily? Several useful i_version applications could be negatively impacted by frequent i_version invalidation. Thanks, Florian
On Mon, Sep 12, 2022 at 07:42:16AM -0400, Jeff Layton wrote: > A scheme like that could work. It might be hard to do it without a > spinlock or something, but maybe that's ok. Thinking more about how we'd > implement this in the underlying filesystems: > > To do this we'd need 2 64-bit fields in the on-disk and in-memory > superblocks for ext4, xfs and btrfs. On the first mount after a crash, > the filesystem would need to bump s_version_max by the significant > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need > to do that. > > Would there be a way to ensure that the new s_version_max value has made > it to disk? Bumping it by a large value and hoping for the best might be > ok for most cases, but there are always outliers, so it might be > worthwhile to make an i_version increment wait on that if necessary. I was imagining that when you recognize you're getting close, you kick off something which writes s_version_max+2^40 to disk, and then updates s_version_max to that new value on success of the write. The code that increments i_version checks to make sure it wouldn't exceed s_version_max. If it would, something has gone wrong--a write has failed or taken a long time--so it waits or errors out or something, depending on desired filesystem behavior in that case. No locking required in the normal case? --b.
On Mon, 2022-09-12 at 14:13 +0200, Florian Weimer wrote: > * Jeff Layton: > > > To do this we'd need 2 64-bit fields in the on-disk and in-memory > > superblocks for ext4, xfs and btrfs. On the first mount after a crash, > > the filesystem would need to bump s_version_max by the significant > > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need > > to do that. > > > > Would there be a way to ensure that the new s_version_max value has made > > it to disk? Bumping it by a large value and hoping for the best might be > > ok for most cases, but there are always outliers, so it might be > > worthwhile to make an i_version increment wait on that if necessary. > > How common are unclean shutdowns in practice? Do ex64/XFS/btrfs keep > counters in the superblocks for journal replays that can be read easily? > > Several useful i_version applications could be negatively impacted by > frequent i_version invalidation. > One would hope "not very often", but Oopses _are_ something that happens occasionally, even in very stable environments, and it would be best if what we're building can cope with them. Consider: reader writer ---------------------------------------------------------- start with i_version 1 inode updated in memory, i_version++ query, get i_version 2 <<< CRASH : update never makes it to disk, back at 1 after reboot >>> query, get i_version 1 application restarts and redoes write, i_version at 2^40+1 query, get i_version 2^40+1 The main thing we have to avoid here is giving out an i_version that represents two different states of the same inode. This should achieve that. Something else we should consider though is that with enough crashes on a long-lived filesystem, the value could eventually wrap. I think we should acknowledge that fact in advance, and plan to deal with it (particularly if we're going to expose this to userland eventually). Because of the "seen" flag, we have a 63 bit counter to play with. Could we use a similar scheme to the one we use to handle when "jiffies" wraps? Assume that we'd never compare two values that were more than 2^62 apart? We could add i_version_before/i_version_after macros to make it simple to handle this.
On Mon, 2022-09-12 at 08:54 -0400, J. Bruce Fields wrote: > On Mon, Sep 12, 2022 at 07:42:16AM -0400, Jeff Layton wrote: > > A scheme like that could work. It might be hard to do it without a > > spinlock or something, but maybe that's ok. Thinking more about how we'd > > implement this in the underlying filesystems: > > > > To do this we'd need 2 64-bit fields in the on-disk and in-memory > > superblocks for ext4, xfs and btrfs. On the first mount after a crash, > > the filesystem would need to bump s_version_max by the significant > > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need > > to do that. > > > > Would there be a way to ensure that the new s_version_max value has made > > it to disk? Bumping it by a large value and hoping for the best might be > > ok for most cases, but there are always outliers, so it might be > > worthwhile to make an i_version increment wait on that if necessary. > > I was imagining that when you recognize you're getting close, you kick > off something which writes s_version_max+2^40 to disk, and then updates > s_version_max to that new value on success of the write. > Ok, that makes sense. > The code that increments i_version checks to make sure it wouldn't > exceed s_version_max. If it would, something has gone wrong--a write > has failed or taken a long time--so it waits or errors out or something, > depending on desired filesystem behavior in that case. > Maybe could just throw a big scary pr_warn too? I'd have to think about how we'd want to handle this case. > No locking required in the normal case? Yeah, maybe not.
* Jeff Layton: > On Mon, 2022-09-12 at 14:13 +0200, Florian Weimer wrote: >> * Jeff Layton: >> >> > To do this we'd need 2 64-bit fields in the on-disk and in-memory >> > superblocks for ext4, xfs and btrfs. On the first mount after a crash, >> > the filesystem would need to bump s_version_max by the significant >> > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need >> > to do that. >> > >> > Would there be a way to ensure that the new s_version_max value has made >> > it to disk? Bumping it by a large value and hoping for the best might be >> > ok for most cases, but there are always outliers, so it might be >> > worthwhile to make an i_version increment wait on that if necessary. >> >> How common are unclean shutdowns in practice? Do ex64/XFS/btrfs keep >> counters in the superblocks for journal replays that can be read easily? >> >> Several useful i_version applications could be negatively impacted by >> frequent i_version invalidation. >> > > One would hope "not very often", but Oopses _are_ something that happens > occasionally, even in very stable environments, and it would be best if > what we're building can cope with them. I was wondering if such unclean shutdown events are associated with SSD “unsafe shutdowns”, as identified by the SMART counter. I think those aren't necessarily restricted to oopses or various forms of powerless (maybe depending on file system/devicemapper configuration)? I admit it's possible that the file system is shut down cleanly before the kernel requests the power-off state from the firmware, but the underlying SSD is not. Thanks, Florian
On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote: > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > The machine crashes and comes back up, and we get a query for i_version > > and it comes back as X. Fine, it's an old version. Now there is a write. > > What do we do to ensure that the new value doesn't collide with X+1? > > (I missed this bit in my earlier reply..) > > How is it "Fine" to see an old version? > The file could have changed without the version changing. > And I thought one of the goals of the crash-count was to be able to > provide a monotonic change id. I was still mainly thinking about how to provide reliable close-to-open semantics between NFS clients. In the case the writer was an NFS client, it wasn't done writing (or it would have COMMITted), so those writes will come in and bump the change attribute soon, and as long as we avoid the small chance of reusing an old change attribute, we're OK, and I think it'd even still be OK to advertise CHANGE_TYPE_IS_MONOTONIC_INCR. If we're trying to do better than that, I'm just not sure what's right. --b.
On Mon, 2022-09-12 at 15:20 +0200, Florian Weimer wrote: > * Jeff Layton: > > > On Mon, 2022-09-12 at 14:13 +0200, Florian Weimer wrote: > > > * Jeff Layton: > > > > > > > To do this we'd need 2 64-bit fields in the on-disk and in-memory > > > > superblocks for ext4, xfs and btrfs. On the first mount after a crash, > > > > the filesystem would need to bump s_version_max by the significant > > > > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need > > > > to do that. > > > > > > > > Would there be a way to ensure that the new s_version_max value has made > > > > it to disk? Bumping it by a large value and hoping for the best might be > > > > ok for most cases, but there are always outliers, so it might be > > > > worthwhile to make an i_version increment wait on that if necessary. > > > > > > How common are unclean shutdowns in practice? Do ex64/XFS/btrfs keep > > > counters in the superblocks for journal replays that can be read easily? > > > > > > Several useful i_version applications could be negatively impacted by > > > frequent i_version invalidation. > > > > > > > One would hope "not very often", but Oopses _are_ something that happens > > occasionally, even in very stable environments, and it would be best if > > what we're building can cope with them. > > I was wondering if such unclean shutdown events are associated with SSD > “unsafe shutdowns”, as identified by the SMART counter. I think those > aren't necessarily restricted to oopses or various forms of powerless > (maybe depending on file system/devicemapper configuration)? > > I admit it's possible that the file system is shut down cleanly before > the kernel requests the power-off state from the firmware, but the > underlying SSD is not. > Yeah filesystem integrity is mostly what we're concerned with here. I think most local filesystems effectively set a flag in the superblock that is cleared when the it's cleanly unmounted. If that flag is set when you go to mount then you know there was a crash. We'd probably key off of that in some way internally.
On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote: > Because of the "seen" flag, we have a 63 bit counter to play with. Could > we use a similar scheme to the one we use to handle when "jiffies" > wraps? Assume that we'd never compare two values that were more than > 2^62 apart? We could add i_version_before/i_version_after macros to make > it simple to handle this. As far as I recall the protocol just assumes it can never wrap. I guess you could add a new change_attr_type that works the way you describe. But without some new protocol clients aren't going to know what to do with a change attribute that wraps. I think this just needs to be designed so that wrapping is impossible in any realistic scenario. I feel like that's doable? If we feel we have to catch that case, the only 100% correct behavior would probably be to make the filesystem readonly. --b.
On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote: > On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote: > > Because of the "seen" flag, we have a 63 bit counter to play with. Could > > we use a similar scheme to the one we use to handle when "jiffies" > > wraps? Assume that we'd never compare two values that were more than > > 2^62 apart? We could add i_version_before/i_version_after macros to make > > it simple to handle this. > > As far as I recall the protocol just assumes it can never wrap. I guess > you could add a new change_attr_type that works the way you describe. > But without some new protocol clients aren't going to know what to do > with a change attribute that wraps. > Right, I think that's the case now, and with contemporary hardware that shouldn't ever happen, but in 10 years when we're looking at femtosecond latencies, could this be different? I don't know. > I think this just needs to be designed so that wrapping is impossible in > any realistic scenario. I feel like that's doable? > > If we feel we have to catch that case, the only 100% correct behavior > would probably be to make the filesystem readonly. What would be the recourse at that point? Rebuild the fs from scratch, I guess?
On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote: > On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote: > > Because of the "seen" flag, we have a 63 bit counter to play with. > > Could > > we use a similar scheme to the one we use to handle when "jiffies" > > wraps? Assume that we'd never compare two values that were more > > than > > 2^62 apart? We could add i_version_before/i_version_after macros to > > make > > it simple to handle this. > > As far as I recall the protocol just assumes it can never wrap. I > guess > you could add a new change_attr_type that works the way you describe. > But without some new protocol clients aren't going to know what to do > with a change attribute that wraps. > > I think this just needs to be designed so that wrapping is impossible > in > any realistic scenario. I feel like that's doable? > > If we feel we have to catch that case, the only 100% correct behavior > would probably be to make the filesystem readonly. > Which protocol? If you're talking about basic NFSv4, it doesn't assume anything about the change attribute and wrapping. The NFSv4.2 protocol did introduce the optional attribute 'change_attr_type' that tries to describe the change attribute behaviour to the client. It tells you if the behaviour is monotonically increasing, but doesn't say anything about the behaviour when the attribute value overflows. That said, the Linux NFSv4.2 client, which uses that change_attr_type attribute does deal with overflow by assuming standard uint64_t wrap around rules. i.e. it assumes bit values > 63 are truncated, meaning that the value obtained by incrementing (2^64-1) is 0.
On Mon, Sep 12, 2022 at 10:02:27AM -0400, Jeff Layton wrote: > On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote: > > On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote: > > > Because of the "seen" flag, we have a 63 bit counter to play with. Could > > > we use a similar scheme to the one we use to handle when "jiffies" > > > wraps? Assume that we'd never compare two values that were more than > > > 2^62 apart? We could add i_version_before/i_version_after macros to make > > > it simple to handle this. > > > > As far as I recall the protocol just assumes it can never wrap. I guess > > you could add a new change_attr_type that works the way you describe. > > But without some new protocol clients aren't going to know what to do > > with a change attribute that wraps. > > > > Right, I think that's the case now, and with contemporary hardware that > shouldn't ever happen, but in 10 years when we're looking at femtosecond > latencies, could this be different? I don't know. That doesn't sound likely. We probably need not just 2^63 writes to a single file, but a dependent sequence of 2^63 interspersed writes and change attribute reads. Then there's the question of how many crashes and remounts are possible for a single filesystem in the worst case. > > > I think this just needs to be designed so that wrapping is impossible in > > any realistic scenario. I feel like that's doable? > > > > If we feel we have to catch that case, the only 100% correct behavior > > would probably be to make the filesystem readonly. > > What would be the recourse at that point? Rebuild the fs from scratch, I > guess? I guess. --b.
On Mon, Sep 12, 2022 at 02:15:16PM +0000, Trond Myklebust wrote: > On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote: > > On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote: > > > Because of the "seen" flag, we have a 63 bit counter to play with. > > > Could > > > we use a similar scheme to the one we use to handle when "jiffies" > > > wraps? Assume that we'd never compare two values that were more > > > than > > > 2^62 apart? We could add i_version_before/i_version_after macros to > > > make > > > it simple to handle this. > > > > As far as I recall the protocol just assumes it can never wrap. I > > guess > > you could add a new change_attr_type that works the way you describe. > > But without some new protocol clients aren't going to know what to do > > with a change attribute that wraps. > > > > I think this just needs to be designed so that wrapping is impossible > > in > > any realistic scenario. I feel like that's doable? > > > > If we feel we have to catch that case, the only 100% correct behavior > > would probably be to make the filesystem readonly. > > > > Which protocol? If you're talking about basic NFSv4, it doesn't assume > anything about the change attribute and wrapping. > > The NFSv4.2 protocol did introduce the optional attribute > 'change_attr_type' that tries to describe the change attribute > behaviour to the client. It tells you if the behaviour is monotonically > increasing, but doesn't say anything about the behaviour when the > attribute value overflows. > > That said, the Linux NFSv4.2 client, which uses that change_attr_type > attribute does deal with overflow by assuming standard uint64_t wrap > around rules. i.e. it assumes bit values > 63 are truncated, meaning > that the value obtained by incrementing (2^64-1) is 0. Yeah, it was the MONOTONIC_INCRE case I was thinking of. That's interesting, I didn't know the client did that. --b.
On Mon, 2022-09-12 at 10:50 -0400, J. Bruce Fields wrote: > On Mon, Sep 12, 2022 at 02:15:16PM +0000, Trond Myklebust wrote: > > On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote: > > > On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote: > > > > Because of the "seen" flag, we have a 63 bit counter to play > > > > with. > > > > Could > > > > we use a similar scheme to the one we use to handle when > > > > "jiffies" > > > > wraps? Assume that we'd never compare two values that were more > > > > than > > > > 2^62 apart? We could add i_version_before/i_version_after > > > > macros to > > > > make > > > > it simple to handle this. > > > > > > As far as I recall the protocol just assumes it can never wrap. > > > I > > > guess > > > you could add a new change_attr_type that works the way you > > > describe. > > > But without some new protocol clients aren't going to know what > > > to do > > > with a change attribute that wraps. > > > > > > I think this just needs to be designed so that wrapping is > > > impossible > > > in > > > any realistic scenario. I feel like that's doable? > > > > > > If we feel we have to catch that case, the only 100% correct > > > behavior > > > would probably be to make the filesystem readonly. > > > > > > > Which protocol? If you're talking about basic NFSv4, it doesn't > > assume > > anything about the change attribute and wrapping. > > > > The NFSv4.2 protocol did introduce the optional attribute > > 'change_attr_type' that tries to describe the change attribute > > behaviour to the client. It tells you if the behaviour is > > monotonically > > increasing, but doesn't say anything about the behaviour when the > > attribute value overflows. > > > > That said, the Linux NFSv4.2 client, which uses that > > change_attr_type > > attribute does deal with overflow by assuming standard uint64_t > > wrap > > around rules. i.e. it assumes bit values > 63 are truncated, > > meaning > > that the value obtained by incrementing (2^64-1) is 0. > > Yeah, it was the MONOTONIC_INCRE case I was thinking of. That's > interesting, I didn't know the client did that. > If you look at where we compare version numbers, it is always some variant of the following: static int nfs_inode_attrs_cmp_monotonic(const struct nfs_fattr *fattr, const struct inode *inode) { s64 diff = fattr->change_attr - inode_peek_iversion_raw(inode); if (diff > 0) return 1; return diff == 0 ? 0 : -1; } i.e. we do an unsigned 64-bit subtraction, and then cast it to the signed 64-bit equivalent in order to figure out which is the more recent value.
On Mon, 2022-09-12 at 14:56 +0000, Trond Myklebust wrote: > On Mon, 2022-09-12 at 10:50 -0400, J. Bruce Fields wrote: > > On Mon, Sep 12, 2022 at 02:15:16PM +0000, Trond Myklebust wrote: > > > On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote: > > > > On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote: > > > > > Because of the "seen" flag, we have a 63 bit counter to play > > > > > with. > > > > > Could > > > > > we use a similar scheme to the one we use to handle when > > > > > "jiffies" > > > > > wraps? Assume that we'd never compare two values that were > > > > > more > > > > > than > > > > > 2^62 apart? We could add i_version_before/i_version_after > > > > > macros to > > > > > make > > > > > it simple to handle this. > > > > > > > > As far as I recall the protocol just assumes it can never > > > > wrap. > > > > I > > > > guess > > > > you could add a new change_attr_type that works the way you > > > > describe. > > > > But without some new protocol clients aren't going to know what > > > > to do > > > > with a change attribute that wraps. > > > > > > > > I think this just needs to be designed so that wrapping is > > > > impossible > > > > in > > > > any realistic scenario. I feel like that's doable? > > > > > > > > If we feel we have to catch that case, the only 100% correct > > > > behavior > > > > would probably be to make the filesystem readonly. > > > > > > > > > > Which protocol? If you're talking about basic NFSv4, it doesn't > > > assume > > > anything about the change attribute and wrapping. > > > > > > The NFSv4.2 protocol did introduce the optional attribute > > > 'change_attr_type' that tries to describe the change attribute > > > behaviour to the client. It tells you if the behaviour is > > > monotonically > > > increasing, but doesn't say anything about the behaviour when the > > > attribute value overflows. > > > > > > That said, the Linux NFSv4.2 client, which uses that > > > change_attr_type > > > attribute does deal with overflow by assuming standard uint64_t > > > wrap > > > around rules. i.e. it assumes bit values > 63 are truncated, > > > meaning > > > that the value obtained by incrementing (2^64-1) is 0. > > > > Yeah, it was the MONOTONIC_INCRE case I was thinking of. That's > > interesting, I didn't know the client did that. > > > > If you look at where we compare version numbers, it is always some > variant of the following: > > static int nfs_inode_attrs_cmp_monotonic(const struct nfs_fattr > *fattr, > const struct inode *inode) > { > s64 diff = fattr->change_attr - > inode_peek_iversion_raw(inode); > if (diff > 0) > return 1; > return diff == 0 ? 0 : -1; > } > > i.e. we do an unsigned 64-bit subtraction, and then cast it to the > signed 64-bit equivalent in order to figure out which is the more > recent value. > ...and by the way, yes this does mean that if you suddenly add a value of 2^63 to the change attribute, then you are likely to cause the client to think that you just handed it an old value. i.e. you're better off having the crash counter increment the change attribute by a relatively small value. One that is guaranteed to be larger than the values that may have been lost, but that is not excessively large.
On Mon, 2022-09-12 at 15:32 +0000, Trond Myklebust wrote: > On Mon, 2022-09-12 at 14:56 +0000, Trond Myklebust wrote: > > On Mon, 2022-09-12 at 10:50 -0400, J. Bruce Fields wrote: > > > On Mon, Sep 12, 2022 at 02:15:16PM +0000, Trond Myklebust wrote: > > > > On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote: > > > > > On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote: > > > > > > Because of the "seen" flag, we have a 63 bit counter to play > > > > > > with. > > > > > > Could > > > > > > we use a similar scheme to the one we use to handle when > > > > > > "jiffies" > > > > > > wraps? Assume that we'd never compare two values that were > > > > > > more > > > > > > than > > > > > > 2^62 apart? We could add i_version_before/i_version_after > > > > > > macros to > > > > > > make > > > > > > it simple to handle this. > > > > > > > > > > As far as I recall the protocol just assumes it can never > > > > > wrap. > > > > > I > > > > > guess > > > > > you could add a new change_attr_type that works the way you > > > > > describe. > > > > > But without some new protocol clients aren't going to know what > > > > > to do > > > > > with a change attribute that wraps. > > > > > > > > > > I think this just needs to be designed so that wrapping is > > > > > impossible > > > > > in > > > > > any realistic scenario. I feel like that's doable? > > > > > > > > > > If we feel we have to catch that case, the only 100% correct > > > > > behavior > > > > > would probably be to make the filesystem readonly. > > > > > > > > > > > > > Which protocol? If you're talking about basic NFSv4, it doesn't > > > > assume > > > > anything about the change attribute and wrapping. > > > > > > > > The NFSv4.2 protocol did introduce the optional attribute > > > > 'change_attr_type' that tries to describe the change attribute > > > > behaviour to the client. It tells you if the behaviour is > > > > monotonically > > > > increasing, but doesn't say anything about the behaviour when the > > > > attribute value overflows. > > > > > > > > That said, the Linux NFSv4.2 client, which uses that > > > > change_attr_type > > > > attribute does deal with overflow by assuming standard uint64_t > > > > wrap > > > > around rules. i.e. it assumes bit values > 63 are truncated, > > > > meaning > > > > that the value obtained by incrementing (2^64-1) is 0. > > > > > > Yeah, it was the MONOTONIC_INCRE case I was thinking of. That's > > > interesting, I didn't know the client did that. > > > > > > > If you look at where we compare version numbers, it is always some > > variant of the following: > > > > static int nfs_inode_attrs_cmp_monotonic(const struct nfs_fattr > > *fattr, > > const struct inode *inode) > > { > > s64 diff = fattr->change_attr - > > inode_peek_iversion_raw(inode); > > if (diff > 0) > > return 1; > > return diff == 0 ? 0 : -1; > > } > > > > i.e. we do an unsigned 64-bit subtraction, and then cast it to the > > signed 64-bit equivalent in order to figure out which is the more > > recent value. > > Good! This seems like the reasonable thing to do, given that the spec doesn't really say that the change attribute has to start at low values. > > ...and by the way, yes this does mean that if you suddenly add a value > of 2^63 to the change attribute, then you are likely to cause the > client to think that you just handed it an old value. > > i.e. you're better off having the crash counter increment the change > attribute by a relatively small value. One that is guaranteed to be > larger than the values that may have been lost, but that is not > excessively large. > Yeah. Like with jiffies, you need to make sure the samples you're comparing aren't _too_ far off. That should be doable here -- 62 bits is plenty of room to store a lot of change values. My benchmark (maybe wrong, but maybe good enough) is to figure on an increment per nanosecond for a worst-case scenario. With that, 2^40 nanoseconds is >12 days. Maybe that's overkill. 2^32 ns is about an hour and 20 mins. That's probably a reasonable value to use. If we can't get a a new value onto disk in that time then something is probably very wrong.
On Mon, 12 Sep 2022, J. Bruce Fields wrote: > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote: > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > > > The machine crashes and comes back up, and we get a query for i_version > > > and it comes back as X. Fine, it's an old version. Now there is a write. > > > What do we do to ensure that the new value doesn't collide with X+1? > > > > (I missed this bit in my earlier reply..) > > > > How is it "Fine" to see an old version? > > The file could have changed without the version changing. > > And I thought one of the goals of the crash-count was to be able to > > provide a monotonic change id. > > I was still mainly thinking about how to provide reliable close-to-open > semantics between NFS clients. In the case the writer was an NFS > client, it wasn't done writing (or it would have COMMITted), so those > writes will come in and bump the change attribute soon, and as long as > we avoid the small chance of reusing an old change attribute, we're OK, > and I think it'd even still be OK to advertise > CHANGE_TYPE_IS_MONOTONIC_INCR. You seem to be assuming that the client doesn't crash at the same time as the server (maybe they are both VMs on a host that lost power...) If client A reads and caches, client B writes, the server crashes after writing some data (to already allocated space so no inode update needed) but before writing the new i_version, then client B crashes. When server comes back the i_version will be unchanged but the data has changed. Client A will cache old data indefinitely... > > If we're trying to do better than that, I'm just not sure what's right. I think we need to require the filesystem to ensure that the i_version is seen to increase shortly after any change becomes visible in the file, and no later than the moment when the request that initiated the change is acknowledged as being complete. In the case of an unclean restart, any file that is not known to have been unchanged immediately before the crash must have i_version increased. The simplest implementation is to have an unclean-restart counter and to always included this multiplied by some constant X in the reported i_version. The filesystem guarantees to record (e.g. to journal at least) the i_version if it comes close to X more than the previous record. The filesystem gets to choose X. A more complex solution would be to record (similar to the way orphans are recorded) any file which is open for write, and to add X to the i_version for any "dirty" file still recorded during an unclean restart. This would avoid bumping the i_version for read-only files. There may be other solutions, but we should leave that up to the filesystem. Each filesystem might choose something different. NeilBrown
On Mon, 12 Sep 2022, Jeff Layton wrote: > On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote: > > This could be expensive. > > > > There is not currently any locking around O_DIRECT writes. You cannot > > synchronise with them. > > > > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold > inode_lock_shared across the I/O. That was why patch #8 takes the > inode_lock (exclusive) across the getattr. Looking at ext4_dio_write_iter() it certain does take inode_lock_shared() before starting the write and in some cases it requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for the write to complete. But not in all cases. So I don't think it always holds the shared lock across all direct IO. > > > The best you can do is update the i_version immediately after all the > > O_DIRECT writes in a single request complete. > > > > > > > > To summarize, there are two main uses for the change attr in NFSv4: > > > > > > 1/ to provide change_info4 for directory morphing operations (CREATE, > > > LINK, OPEN, REMOVE, and RENAME). It turns out that this is already > > > atomic in the current nfsd code (AFAICT) by virtue of the fact that we > > > hold the i_rwsem exclusively over these operations. The change attr is > > > also queried pre and post while the lock is held, so that should ensure > > > that we get true atomicity for this. > > > > Yes, directory ops are relatively easy. > > > > > > > > 2/ as an adjunct for the ctime when fetching attributes to validate > > > caches. We don't expect perfect consistency between read (and readlike) > > > operations and GETATTR, even when they're in the same compound. > > > > > > IOW, a READ+GETATTR compound can legally give you a short (or zero- > > > length) read, and then the getattr indicates a size that is larger than > > > where the READ data stops, due to a write or truncate racing in after > > > the read. > > > > I agree that atomicity is neither necessary nor practical. Ordering is > > important though. I don't think a truncate(0) racing with a READ can > > credibly result in a non-zero size AFTER a zero-length read. A truncate > > that extends the size could have that effect though. > > > > > > > > Ideally, the attributes in the GETATTR reply should be consistent > > > between themselves though. IOW, all of the attrs should accurately > > > represent the state of the file at a single point in time. > > > change+size+times+etc. should all be consistent with one another. > > > > > > I think we get all of this by taking the inode_lock around the > > > vfs_getattr call in nfsd4_encode_fattr. It may not be the most elegant > > > solution, but it should give us the atomicity we need, and it doesn't > > > require adding extra operations or locking to the write codepaths. > > > > Explicit attribute changes (chown/chmod/utimes/truncate etc) are always > > done under the inode lock. Implicit changes via inode_update_time() are > > not (though xfs does take the lock, ext4 doesn't, haven't checked > > others). So taking the inode lock won't ensure those are internally > > consistent. > > > > I think using inode_lock_shared() is acceptable. It doesn't promise > > perfect atomicity, but it is probably good enough. > > > > We'd need a good reason to want perfect atomicity to go further, and I > > cannot think of one. > > > > > > Taking inode_lock_shared is sufficient to block out buffered and DAX > writes. DIO writes sometimes only take the shared lock (e.g. when the > data is already properly aligned). If we want to ensure the getattr > doesn't run while _any_ writes are running, we'd need the exclusive > lock. But the exclusive lock is bad for scalability. > > Maybe that's overkill, though it seems like we could have a race like > this without taking inode_lock across the getattr: > > reader writer > ----------------------------------------------------------------- > i_version++ > getattr > read > DIO write to backing store > This is why I keep saying that the i_version increment must be after the write, not before it. > > Given that we can't fully exclude mmap writes, maybe we can just > document that mixing DIO or mmap writes on the server + NFS may not be > fully cache coherent. "fully cache coherent" is really more than anyone needs. The i_version must be seen to change no earlier than the related change becomes visible, and no later than the request which initiated that change is acknowledged as complete. NeilBrown
>>>>> "Jeff" == Jeff Layton <jlayton@kernel.org> writes: > On Sat, 2022-09-10 at 10:56 -0400, J. Bruce Fields wrote: >> On Fri, Sep 09, 2022 at 12:36:29PM -0400, Jeff Layton wrote: >> > On Fri, 2022-09-09 at 11:45 -0400, J. Bruce Fields wrote: >> > > On Thu, Sep 08, 2022 at 03:07:58PM -0400, Jeff Layton wrote: >> > > > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote: >> > > > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote: >> > > > > > Yeah, ok. That does make some sense. So we would mix this into the >> > > > > > i_version instead of the ctime when it was available. Preferably, we'd >> > > > > > mix that in when we store the i_version rather than adding it afterward. >> > > > > > >> > > > > > Ted, how would we access this? Maybe we could just add a new (generic) >> > > > > > super_block field for this that ext4 (and other filesystems) could >> > > > > > populate at mount time? >> > > > > >> > > > > Couldn't the filesystem just return an ino_version that already includes >> > > > > it? >> > > > > >> > > > >> > > > Yes. That's simple if we want to just fold it in during getattr. If we >> > > > want to fold that into the values stored on disk, then I'm a little less >> > > > clear on how that will work. >> > > > >> > > > Maybe I need a concrete example of how that will work: >> > > > >> > > > Suppose we have an i_version value X with the previous crash counter >> > > > already factored in that makes it to disk. We hand out a newer version >> > > > X+1 to a client, but that value never makes it to disk. >> > > > >> > > > The machine crashes and comes back up, and we get a query for i_version >> > > > and it comes back as X. Fine, it's an old version. Now there is a write. >> > > > What do we do to ensure that the new value doesn't collide with X+1? >> > > >> > > I was assuming we could partition i_version's 64 bits somehow: e.g., top >> > > 16 bits store the crash counter. You increment the i_version by: 1) >> > > replacing the top bits by the new crash counter, if it has changed, and >> > > 2) incrementing. >> > > >> > > Do the numbers work out? 2^16 mounts after unclean shutdowns sounds >> > > like a lot for one filesystem, as does 2^48 changes to a single file, >> > > but people do weird things. Maybe there's a better partitioning, or >> > > some more flexible way of maintaining an i_version that still allows you >> > > to identify whether a given i_version preceded a crash. >> > > >> > >> > We consume one bit to keep track of the "seen" flag, so it would be a >> > 16+47 split. I assume that we'd also reset the version counter to 0 when >> > the crash counter changes? Maybe that doesn't matter as long as we don't >> > overflow into the crash counter. >> > >> > I'm not sure we can get away with 16 bits for the crash counter, as >> > it'll leave us subject to the version counter wrapping after a long >> > uptimes. >> > >> > If you increment a counter every nanosecond, how long until that counter >> > wraps? With 63 bits, that's 292 years (and change). With 16+47 bits, >> > that's less than two days. An 8+55 split would give us ~416 days which >> > seems a bit more reasonable? >> >> Though now it's starting to seem a little limiting to allow only 2^8 >> mounts after unclean shutdowns. >> >> Another way to think of it might be: multiply that 8-bit crash counter >> by 2^48, and think of it as a 64-bit value that we believe (based on >> practical limits on how many times you can modify a single file) is >> gauranteed to be larger than any i_version that we gave out before the >> most recent crash. >> >> Our goal is to ensure that after a crash, any *new* i_versions that we >> give out or write to disk are larger than any that have previously been >> given out. We can do that by ensuring that they're equal to at least >> that old maximum. >> >> So think of the 64-bit value we're storing in the superblock as a >> ceiling on i_version values across all the filesystem's inodes. Call it >> s_version_max or something. We also need to know what the maximum was >> before the most recent crash. Call that s_version_max_old. >> >> Then we could get correct behavior if we generated i_versions with >> something like: >> >> i_version++; >> if (i_version < s_version_max_old) >> i_version = s_version_max_old; >> if (i_version > s_version_max) >> s_version_max = i_version + 1; >> >> But that last step makes this ludicrously expensive, because for this to >> be safe across crashes we need to update that value on disk as well, and >> we need to do that frequently. >> >> Fortunately, s_version_max doesn't have to be a tight bound at all. We >> can easily just initialize it to, say, 2^40, and only bump it by 2^40 at >> a time. And recognize when we're running up against it way ahead of >> time, so we only need to say "here's an updated value, could you please >> make sure it gets to disk sometime in the next twenty minutes"? >> (Numbers made up.) >> >> Sorry, that was way too many words. But I think something like that >> could work, and make it very difficult to hit any hard limits, and >> actually not be too complicated?? Unless I missed something. >> > That's not too many words -- I appreciate a good "for dummies" > explanation! > A scheme like that could work. It might be hard to do it without a > spinlock or something, but maybe that's ok. Thinking more about how we'd > implement this in the underlying filesystems: > To do this we'd need 2 64-bit fields in the on-disk and in-memory > superblocks for ext4, xfs and btrfs. On the first mount after a crash, > the filesystem would need to bump s_version_max by the significant > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need > to do that. > Would there be a way to ensure that the new s_version_max value has made > it to disk? Bumping it by a large value and hoping for the best might be > ok for most cases, but there are always outliers, so it might be > worthwhile to make an i_version increment wait on that if necessary. Would it be silly to steal the same idea from the DNS folks where they can wrap the 32 bit serial number around by incrementing it by a large amount, pushing the out the change, then incrementing back down to 1 to wrap the counter? I just worry about space limited counters that don't automatically wrap, or allow people to force them to wrap gracefully with out major hassles. But I come at this all from the IT side of things, not the programming/kernel side. John
On Mon, Sep 12, 2022 at 07:42:16AM -0400, Jeff Layton wrote: > On Sat, 2022-09-10 at 10:56 -0400, J. Bruce Fields wrote: > > On Fri, Sep 09, 2022 at 12:36:29PM -0400, Jeff Layton wrote: > > Our goal is to ensure that after a crash, any *new* i_versions that we > > give out or write to disk are larger than any that have previously been > > given out. We can do that by ensuring that they're equal to at least > > that old maximum. > > > > So think of the 64-bit value we're storing in the superblock as a > > ceiling on i_version values across all the filesystem's inodes. Call it > > s_version_max or something. We also need to know what the maximum was > > before the most recent crash. Call that s_version_max_old. > > > > Then we could get correct behavior if we generated i_versions with > > something like: > > > > i_version++; > > if (i_version < s_version_max_old) > > i_version = s_version_max_old; > > if (i_version > s_version_max) > > s_version_max = i_version + 1; > > > > But that last step makes this ludicrously expensive, because for this to > > be safe across crashes we need to update that value on disk as well, and > > we need to do that frequently. > > > > Fortunately, s_version_max doesn't have to be a tight bound at all. We > > can easily just initialize it to, say, 2^40, and only bump it by 2^40 at > > a time. And recognize when we're running up against it way ahead of > > time, so we only need to say "here's an updated value, could you please > > make sure it gets to disk sometime in the next twenty minutes"? > > (Numbers made up.) > > > > Sorry, that was way too many words. But I think something like that > > could work, and make it very difficult to hit any hard limits, and > > actually not be too complicated?? Unless I missed something. > > > > That's not too many words -- I appreciate a good "for dummies" > explanation! > > A scheme like that could work. It might be hard to do it without a > spinlock or something, but maybe that's ok. Thinking more about how we'd > implement this in the underlying filesystems: > > To do this we'd need 2 64-bit fields in the on-disk and in-memory > superblocks for ext4, xfs and btrfs. On the first mount after a crash, > the filesystem would need to bump s_version_max by the significant > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need > to do that. Why only increment on crash? If the filesystem has been unmounted, then any cached data is -stale- and must be discarded. e.g. unmount, run fsck which cleans up corrupt files but does not modify i_version, then mount. Remote caches are now invalid, but i_version may not have changed, so we still need the clean unmount-mount cycle to invalidate caches. IOWs, what we want is a salted i_version value, with the filesystem providing the unique per-mount salt that gets added to the externally visible i_version values. If that's the case, the salt doesn't need to be restricted to just modifying the upper bits - as long as the salt increments substantially and independently to the on-disk inode i_version then we just don't care what bits of the superblock salt change from mount to mount. For XFS we already have a unique 64 bit salt we could use for every mount - clean or unclean - and guarantee it is larger for every mount. It also gets substantially bumped by fsck, too. It's called a Log Sequence Number and we use them to track and strictly order every modification we write into the log. This is exactly what is needed for a i_version salt, and it's already guaranteed to be persistent. > Would there be a way to ensure that the new s_version_max value has made > it to disk? Yes, but that's not really relevant to the definition of the salt: we don't need to design the filesystem implementation of a persistent per-mount salt value. All we need is to define the behaviour of the salt (e.g. must always increase across a umount/mount cycle) and then you can let the filesystem developers worry about how to provide the required salt behaviour and it's persistence. In the mean time, you can implement the salting and testing it by using the system time to seed the superblock salt - that's good enough for proof of concept, and as a fallback for filesystems that cannot provide the required per-mount salt persistence.... > Bumping it by a large value and hoping for the best might be > ok for most cases, but there are always outliers, so it might be > worthwhile to make an i_version increment wait on that if necessary. Nothing should be able to query i_version until the filesystem is fully recovered, mounted and the salt has been set. Hence no application (kernel or userspace) should ever see an unsalted i_version value.... -Dave.
On Tue, Sep 13, 2022 at 09:29:48AM +1000, NeilBrown wrote: > On Mon, 12 Sep 2022, Jeff Layton wrote: > > On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote: > > > This could be expensive. > > > > > > There is not currently any locking around O_DIRECT writes. You cannot > > > synchronise with them. > > > > > > > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold > > inode_lock_shared across the I/O. That was why patch #8 takes the > > inode_lock (exclusive) across the getattr. > > Looking at ext4_dio_write_iter() it certain does take > inode_lock_shared() before starting the write and in some cases it > requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for > the write to complete. But not in all cases. > So I don't think it always holds the shared lock across all direct IO. To serialise against dio writes, one must: // Lock the inode exclusively to block new DIO submissions inode_lock(inode); // Wait for all in flight DIO reads and writes to complete inode_dio_wait(inode); This is how truncate, fallocate, etc serialise against AIO+DIO which do not hold the inode lock across the entire IO. These have to serialise aginst DIO reads, too, because we can't have IO in progress over a range of the file that we are about to free.... > > Taking inode_lock_shared is sufficient to block out buffered and DAX > > writes. DIO writes sometimes only take the shared lock (e.g. when the > > data is already properly aligned). If we want to ensure the getattr > > doesn't run while _any_ writes are running, we'd need the exclusive > > lock. > > But the exclusive lock is bad for scalability. Serilisation against DIO is -expensive- and -slow-. It's not a solution for what is supposed to be a fast unlocked read-only operation like statx(). > > Maybe that's overkill, though it seems like we could have a race like > > this without taking inode_lock across the getattr: > > > > reader writer > > ----------------------------------------------------------------- > > i_version++ > > getattr > > read > > DIO write to backing store > > > > This is why I keep saying that the i_version increment must be after the > write, not before it. Sure, but that ignores the reason why we actually need to bump i_version *before* we submit a DIO write. DIO write invalidates the page cache over the range of the write, so any racing read will re-populate the page cache during the DIO write. Hence buffered reads can return before the DIO write has completed, and the contents of the read can contain, none, some or all of the contents of the DIO write. Hence i_version has to be incremented before the DIO write is submitted so that racing getattrs will indicate that the local caches have been invalidated and that data needs to be refetched. But, yes, to actually be safe here, we *also* should be bumping i_version on DIO write on DIO write completion so that racing i_version and data reads that occur *after* the initial i_version bump are invalidated immediately. IOWs, to avoid getattr/read races missing stale data invalidations during DIO writes, we really need to bump i_version both _before and after_ DIO write submission. It's corner cases like this where "i_version should only be bumped when ctime changes" fails completely. i.e. there are concurrent IO situations which can only really be handled correctly by bumping i_version whenever either in-memory and/or on-disk persistent data/ metadata state changes occur..... Cheers, Dave.
On Tue, 13 Sep 2022, Dave Chinner wrote: > On Tue, Sep 13, 2022 at 09:29:48AM +1000, NeilBrown wrote: > > On Mon, 12 Sep 2022, Jeff Layton wrote: > > > On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote: > > > > This could be expensive. > > > > > > > > There is not currently any locking around O_DIRECT writes. You cannot > > > > synchronise with them. > > > > > > > > > > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold > > > inode_lock_shared across the I/O. That was why patch #8 takes the > > > inode_lock (exclusive) across the getattr. > > > > Looking at ext4_dio_write_iter() it certain does take > > inode_lock_shared() before starting the write and in some cases it > > requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for > > the write to complete. But not in all cases. > > So I don't think it always holds the shared lock across all direct IO. > > To serialise against dio writes, one must: > > // Lock the inode exclusively to block new DIO submissions > inode_lock(inode); > > // Wait for all in flight DIO reads and writes to complete > inode_dio_wait(inode); > > This is how truncate, fallocate, etc serialise against AIO+DIO which > do not hold the inode lock across the entire IO. These have to > serialise aginst DIO reads, too, because we can't have IO in > progress over a range of the file that we are about to free.... > > > > Taking inode_lock_shared is sufficient to block out buffered and DAX > > > writes. DIO writes sometimes only take the shared lock (e.g. when the > > > data is already properly aligned). If we want to ensure the getattr > > > doesn't run while _any_ writes are running, we'd need the exclusive > > > lock. > > > > But the exclusive lock is bad for scalability. > > Serilisation against DIO is -expensive- and -slow-. It's not a > solution for what is supposed to be a fast unlocked read-only > operation like statx(). > > > > Maybe that's overkill, though it seems like we could have a race like > > > this without taking inode_lock across the getattr: > > > > > > reader writer > > > ----------------------------------------------------------------- > > > i_version++ > > > getattr > > > read > > > DIO write to backing store > > > > > > > This is why I keep saying that the i_version increment must be after the > > write, not before it. > > Sure, but that ignores the reason why we actually need to bump > i_version *before* we submit a DIO write. DIO write invalidates the > page cache over the range of the write, so any racing read will > re-populate the page cache during the DIO write. and DIO reads can also get data at some intermediate state of the write. So what? i_version cannot provide coherent caching. You needs locks and call-backs and such for that. i_version (as used by NFS) only aims for approximate caching. Specifically: 1/ max-age caching. The i_version is polled when the age of the cache reaches some preset value, and the cache is purged/reloaded if needed, and the age reset to 0. 2/ close-to-open caching. There are well-defined events where the i_version must reflect preceding events by that the same client (close and unlock) and well defined events when the client will check the i_version before trusting any cache (open and lock). There is absolutely no need ever to update the i_version *before* making a change. If you really think there is, please provide a sequence of events for two different actors/observes where having the update before the change will provide a useful benefit. Thanks, NeilBrown > > Hence buffered reads can return before the DIO write has completed, > and the contents of the read can contain, none, some or all of the > contents of the DIO write. Hence i_version has to be incremented > before the DIO write is submitted so that racing getattrs will > indicate that the local caches have been invalidated and that data > needs to be refetched. > > But, yes, to actually be safe here, we *also* should be bumping > i_version on DIO write on DIO write completion so that racing > i_version and data reads that occur *after* the initial i_version > bump are invalidated immediately. > > IOWs, to avoid getattr/read races missing stale data invalidations > during DIO writes, we really need to bump i_version both _before and > after_ DIO write submission. > > It's corner cases like this where "i_version should only be bumped > when ctime changes" fails completely. i.e. there are concurrent IO > situations which can only really be handled correctly by bumping > i_version whenever either in-memory and/or on-disk persistent data/ > metadata state changes occur..... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Tue, 13 Sep 2022, Dave Chinner wrote: > On Mon, Sep 12, 2022 at 07:42:16AM -0400, Jeff Layton wrote: > > On Sat, 2022-09-10 at 10:56 -0400, J. Bruce Fields wrote: > > > On Fri, Sep 09, 2022 at 12:36:29PM -0400, Jeff Layton wrote: > > > Our goal is to ensure that after a crash, any *new* i_versions that we > > > give out or write to disk are larger than any that have previously been > > > given out. We can do that by ensuring that they're equal to at least > > > that old maximum. > > > > > > So think of the 64-bit value we're storing in the superblock as a > > > ceiling on i_version values across all the filesystem's inodes. Call it > > > s_version_max or something. We also need to know what the maximum was > > > before the most recent crash. Call that s_version_max_old. > > > > > > Then we could get correct behavior if we generated i_versions with > > > something like: > > > > > > i_version++; > > > if (i_version < s_version_max_old) > > > i_version = s_version_max_old; > > > if (i_version > s_version_max) > > > s_version_max = i_version + 1; > > > > > > But that last step makes this ludicrously expensive, because for this to > > > be safe across crashes we need to update that value on disk as well, and > > > we need to do that frequently. > > > > > > Fortunately, s_version_max doesn't have to be a tight bound at all. We > > > can easily just initialize it to, say, 2^40, and only bump it by 2^40 at > > > a time. And recognize when we're running up against it way ahead of > > > time, so we only need to say "here's an updated value, could you please > > > make sure it gets to disk sometime in the next twenty minutes"? > > > (Numbers made up.) > > > > > > Sorry, that was way too many words. But I think something like that > > > could work, and make it very difficult to hit any hard limits, and > > > actually not be too complicated?? Unless I missed something. > > > > > > > That's not too many words -- I appreciate a good "for dummies" > > explanation! > > > > A scheme like that could work. It might be hard to do it without a > > spinlock or something, but maybe that's ok. Thinking more about how we'd > > implement this in the underlying filesystems: > > > > To do this we'd need 2 64-bit fields in the on-disk and in-memory > > superblocks for ext4, xfs and btrfs. On the first mount after a crash, > > the filesystem would need to bump s_version_max by the significant > > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need > > to do that. > > Why only increment on crash? If the filesystem has been unmounted, > then any cached data is -stale- and must be discarded. e.g. unmount, > run fsck which cleans up corrupt files but does not modify > i_version, then mount. Remote caches are now invalid, but i_version > may not have changed, so we still need the clean unmount-mount cycle > to invalidate caches. I disagree. We do need fsck to cause caches to be invalidated IF IT FOUND SOMETHING TO REPAIR, but not if the filesystem was truely clean. > > IOWs, what we want is a salted i_version value, with the filesystem > providing the unique per-mount salt that gets added to the > externally visible i_version values. I agree this is a simple approach. Possible the best. > > If that's the case, the salt doesn't need to be restricted to just > modifying the upper bits - as long as the salt increments > substantially and independently to the on-disk inode i_version then > we just don't care what bits of the superblock salt change from > mount to mount. > > For XFS we already have a unique 64 bit salt we could use for every > mount - clean or unclean - and guarantee it is larger for every > mount. It also gets substantially bumped by fsck, too. It's called a > Log Sequence Number and we use them to track and strictly order > every modification we write into the log. This is exactly what is > needed for a i_version salt, and it's already guaranteed to be > persistent. Invalidating the client cache on EVERY unmount/mount could impose unnecessary cost. Imagine a client that caches a lot of data (several large files) from a server which is expected to fail-over from one cluster node to another from time to time. Adding extra delays to a fail-over is not likely to be well received. I don't *know* this cost would be unacceptable, and I *would* like to leave it to the filesystem to decide how to manage its own i_version values. So maybe XFS can use the LSN for a salt. If people notice the extra cost, they can complain. Thanks, NeilBrown > > > Would there be a way to ensure that the new s_version_max value has made > > it to disk? > > Yes, but that's not really relevant to the definition of the salt: > we don't need to design the filesystem implementation of a > persistent per-mount salt value. All we need is to define the > behaviour of the salt (e.g. must always increase across a > umount/mount cycle) and then you can let the filesystem developers > worry about how to provide the required salt behaviour and it's > persistence. > > In the mean time, you can implement the salting and testing it by > using the system time to seed the superblock salt - that's good > enough for proof of concept, and as a fallback for filesystems that > cannot provide the required per-mount salt persistence.... > > > Bumping it by a large value and hoping for the best might be > > ok for most cases, but there are always outliers, so it might be > > worthwhile to make an i_version increment wait on that if necessary. > > Nothing should be able to query i_version until the filesystem is > fully recovered, mounted and the salt has been set. Hence no > application (kernel or userspace) should ever see an unsalted > i_version value.... > > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On Tue, Sep 13, 2022 at 11:49:03AM +1000, NeilBrown wrote: > On Tue, 13 Sep 2022, Dave Chinner wrote: > > On Mon, Sep 12, 2022 at 07:42:16AM -0400, Jeff Layton wrote: > > > On Sat, 2022-09-10 at 10:56 -0400, J. Bruce Fields wrote: > > > > On Fri, Sep 09, 2022 at 12:36:29PM -0400, Jeff Layton wrote: > > > > Our goal is to ensure that after a crash, any *new* i_versions that we > > > > give out or write to disk are larger than any that have previously been > > > > given out. We can do that by ensuring that they're equal to at least > > > > that old maximum. > > > > > > > > So think of the 64-bit value we're storing in the superblock as a > > > > ceiling on i_version values across all the filesystem's inodes. Call it > > > > s_version_max or something. We also need to know what the maximum was > > > > before the most recent crash. Call that s_version_max_old. > > > > > > > > Then we could get correct behavior if we generated i_versions with > > > > something like: > > > > > > > > i_version++; > > > > if (i_version < s_version_max_old) > > > > i_version = s_version_max_old; > > > > if (i_version > s_version_max) > > > > s_version_max = i_version + 1; > > > > > > > > But that last step makes this ludicrously expensive, because for this to > > > > be safe across crashes we need to update that value on disk as well, and > > > > we need to do that frequently. > > > > > > > > Fortunately, s_version_max doesn't have to be a tight bound at all. We > > > > can easily just initialize it to, say, 2^40, and only bump it by 2^40 at > > > > a time. And recognize when we're running up against it way ahead of > > > > time, so we only need to say "here's an updated value, could you please > > > > make sure it gets to disk sometime in the next twenty minutes"? > > > > (Numbers made up.) > > > > > > > > Sorry, that was way too many words. But I think something like that > > > > could work, and make it very difficult to hit any hard limits, and > > > > actually not be too complicated?? Unless I missed something. > > > > > > > > > > That's not too many words -- I appreciate a good "for dummies" > > > explanation! > > > > > > A scheme like that could work. It might be hard to do it without a > > > spinlock or something, but maybe that's ok. Thinking more about how we'd > > > implement this in the underlying filesystems: > > > > > > To do this we'd need 2 64-bit fields in the on-disk and in-memory > > > superblocks for ext4, xfs and btrfs. On the first mount after a crash, > > > the filesystem would need to bump s_version_max by the significant > > > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need > > > to do that. > > > > Why only increment on crash? If the filesystem has been unmounted, > > then any cached data is -stale- and must be discarded. e.g. unmount, > > run fsck which cleans up corrupt files but does not modify > > i_version, then mount. Remote caches are now invalid, but i_version > > may not have changed, so we still need the clean unmount-mount cycle > > to invalidate caches. > > I disagree. We do need fsck to cause caches to be invalidated IF IT > FOUND SOMETHING TO REPAIR, but not if the filesystem was truely clean. <sigh> Neil, why the fuck are you shouting at me for making the obvious observation that data in cleanly unmount filesystems can be modified when they are off line? Indeed, we know there are many systems out there that mount a filesystem, preallocate and map the blocks that are allocated to a large file, unmount the filesysetm, mmap the ranges of the block device and pass them to RDMA hardware, then have sensor arrays rdma data directly into the block device. Then when the measurement application is done they walk the ondisk metadata to remove the unwritten flags on the extents, mount the filesystem again and export the file data to a HPC cluster for post-processing..... So how does the filesystem know whether data the storage contains for it's files has been modified while it is unmounted and so needs to change the salt? The short answer is that it can't, and so we cannot make assumptions that a unmount/mount cycle has not changed the filesystem in any way.... > > IOWs, what we want is a salted i_version value, with the filesystem > > providing the unique per-mount salt that gets added to the > > externally visible i_version values. > > I agree this is a simple approach. Possible the best. > > > > > If that's the case, the salt doesn't need to be restricted to just > > modifying the upper bits - as long as the salt increments > > substantially and independently to the on-disk inode i_version then > > we just don't care what bits of the superblock salt change from > > mount to mount. > > > > For XFS we already have a unique 64 bit salt we could use for every > > mount - clean or unclean - and guarantee it is larger for every > > mount. It also gets substantially bumped by fsck, too. It's called a > > Log Sequence Number and we use them to track and strictly order > > every modification we write into the log. This is exactly what is > > needed for a i_version salt, and it's already guaranteed to be > > persistent. > > Invalidating the client cache on EVERY unmount/mount could impose > unnecessary cost. Imagine a client that caches a lot of data (several > large files) from a server which is expected to fail-over from one > cluster node to another from time to time. Adding extra delays to a > fail-over is not likely to be well received. HA fail-over is something that happens rarely, and isn't something we should be trying to optimise i_version for. Indeed, HA failover is usually a result of an active server crash/failure, in which case server side filesystem recovery is required before the new node can export the filesystem again. That's exactly the case you are talking about needing to have the salt change to invalidate potentially stale client side i_version values.... If the HA system needs to control the salt for co-ordinated, cache coherent hand-over then -add an option for the HA server to control the salt value itself-. HA orchestration has to handle so much state hand-over between server nodes already that handling a salt value for the mount is no big deal. This really is not something that individual local filesystems need to care about, ever. -Dave.
On Tue, 13 Sep 2022, Dave Chinner wrote: > > Indeed, we know there are many systems out there that mount a > filesystem, preallocate and map the blocks that are allocated to a > large file, unmount the filesysetm, mmap the ranges of the block > device and pass them to RDMA hardware, then have sensor arrays rdma > data directly into the block device. Then when the measurement > application is done they walk the ondisk metadata to remove the > unwritten flags on the extents, mount the filesystem again and > export the file data to a HPC cluster for post-processing..... And this tool doesn't update the i_version? Sounds like a bug. > > So how does the filesystem know whether data the storage contains > for it's files has been modified while it is unmounted and so needs > to change the salt? How does it know that no data is modified while it *is* mounted? Some assumptions have to be made. > > The short answer is that it can't, and so we cannot make assumptions > that a unmount/mount cycle has not changed the filesystem in any > way.... If a mount-count is the best that XFS can do, then that is certainly what it should use. Thanks, NeilBrown
On Tue, Sep 13, 2022 at 01:30:58PM +1000, NeilBrown wrote: > On Tue, 13 Sep 2022, Dave Chinner wrote: > > > > Indeed, we know there are many systems out there that mount a > > filesystem, preallocate and map the blocks that are allocated to a > > large file, unmount the filesysetm, mmap the ranges of the block > > device and pass them to RDMA hardware, then have sensor arrays rdma > > data directly into the block device..... > > And this tool doesn't update the i_version? Sounds like a bug. Tools that do this include "grub" and "lilo". Fortunately, most people aren't trying to export their /boot directory over NFS. :-P That being said, all we can strive for is "good enough" and not "perfection". So if I were to add a "crash counter" to the ext4 superblock, I can make sure it gets incremented (a) whenever the journal is replayed (assuming that we decide to use lazytime-style update for i_version for performance reasons), or (b) when fsck needs to fix some file system inconsistency, or (c) when some external tool like debugfs or fuse2fs is modifying the file system. Will this get *everything*? No. For example, in addition Linux boot loaders, there might be userspace which uses FIEMAP to get the physical blocks #'s for a file, and then reads and writes to those blocks using a kernel-bypass interface for high-speed SSDs, for example. I happen to know of thousands of machines that are doing this with ext4 in production today, so this isn't hypothetical example; fortuntely, they aren't exporting their file system over NFS, nor are they likely to do so. :-) - Ted
On Tue, 2022-09-13 at 11:15 +1000, Dave Chinner wrote: > On Tue, Sep 13, 2022 at 09:29:48AM +1000, NeilBrown wrote: > > On Mon, 12 Sep 2022, Jeff Layton wrote: > > > On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote: > > > > This could be expensive. > > > > > > > > There is not currently any locking around O_DIRECT writes. You cannot > > > > synchronise with them. > > > > > > > > > > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold > > > inode_lock_shared across the I/O. That was why patch #8 takes the > > > inode_lock (exclusive) across the getattr. > > > > Looking at ext4_dio_write_iter() it certain does take > > inode_lock_shared() before starting the write and in some cases it > > requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for > > the write to complete. But not in all cases. > > So I don't think it always holds the shared lock across all direct IO. > > To serialise against dio writes, one must: > > // Lock the inode exclusively to block new DIO submissions > inode_lock(inode); > > // Wait for all in flight DIO reads and writes to complete > inode_dio_wait(inode); > > This is how truncate, fallocate, etc serialise against AIO+DIO which > do not hold the inode lock across the entire IO. These have to > serialise aginst DIO reads, too, because we can't have IO in > progress over a range of the file that we are about to free.... > Thanks, that clarifies a bit. > > > Taking inode_lock_shared is sufficient to block out buffered and DAX > > > writes. DIO writes sometimes only take the shared lock (e.g. when the > > > data is already properly aligned). If we want to ensure the getattr > > > doesn't run while _any_ writes are running, we'd need the exclusive > > > lock. > > > > But the exclusive lock is bad for scalability. > > Serilisation against DIO is -expensive- and -slow-. It's not a > solution for what is supposed to be a fast unlocked read-only > operation like statx(). > Fair enough. I labeled that patch with RFC as I suspected that it would be too expensive. I don't think we can guarantee perfect consistency vs. mmap either, so carving out DIO is not a stretch (at least not for NFSv4). > > > Maybe that's overkill, though it seems like we could have a race like > > > this without taking inode_lock across the getattr: > > > > > > reader writer > > > ----------------------------------------------------------------- > > > i_version++ > > > getattr > > > read > > > DIO write to backing store > > > > > > > This is why I keep saying that the i_version increment must be after the > > write, not before it. > > Sure, but that ignores the reason why we actually need to bump > i_version *before* we submit a DIO write. DIO write invalidates the > page cache over the range of the write, so any racing read will > re-populate the page cache during the DIO write. > > Hence buffered reads can return before the DIO write has completed, > and the contents of the read can contain, none, some or all of the > contents of the DIO write. Hence i_version has to be incremented > before the DIO write is submitted so that racing getattrs will > indicate that the local caches have been invalidated and that data > needs to be refetched. > Bumping the change attribute after the write is done would be sufficient for serving NFSv4. The clients just invalidate their caches if they see the value change. Bumping it before and after would be fine too. We might get some spurious cache invalidations but they'd be infrequent. FWIW, we've never guaranteed any real atomicity with NFS readers vs. writers. Clients may see the intermediate stages of a write from a different client if their reads race in at the right time. If you need real atomicity, then you really should be using locking. What we _do_ try to ensure is timely pagecache invalidation when this occurs. If we want to expose this to userland via statx in the future, then we may need a stronger guarantee because we can't as easily predict how people will want to use this. At that point, bumping i_version both before and after makes a bit more sense, since it better ensures that a change will be noticed, whether the related read op comes before or after the statx. > But, yes, to actually be safe here, we *also* should be bumping > i_version on DIO write on DIO write completion so that racing > i_version and data reads that occur *after* the initial i_version > bump are invalidated immediately. > > IOWs, to avoid getattr/read races missing stale data invalidations > during DIO writes, we really need to bump i_version both _before and > after_ DIO write submission. > > It's corner cases like this where "i_version should only be bumped > when ctime changes" fails completely. i.e. there are concurrent IO > situations which can only really be handled correctly by bumping > i_version whenever either in-memory and/or on-disk persistent data/ > metadata state changes occur..... I think we have two choices (so far) when it comes to closing the race window between the i_version bump and the write. Either should be fine for serving NFSv4. 1/ take the inode_lock in some form across the getattr call for filling out GETATTR/READDIR/NVERIFY info. This is what the RFC patch in my latest set does. That's obviously too expensive though. We could take inode_lock_shared, which wouldn't exclude DIO, but would cover the buffered and DAX codepaths. This is somewhat ugly though, particularly with slow backend network filesystems (like NFS). That getattr could take a while, and meanwhile all writes are stuck... ...or... 2/ start bumping the i_version after a write completes. Bumping it twice (before and after) would be fine too. In most cases the second one will be a no-op anyway. We might get the occasional false cache invalidations there with NFS, but they should be pretty rare and that's preferable to holding on to invalid cached data (which I think is a danger today). To do #2, I guess we'd need to add an inode_maybe_inc_iversion call at the end of the relevant ->write_iter ops, and then dirty the inode if that comes back true? That should be pretty rare. We do also still need some way to mitigate potential repeated versions due to crashes, but that's orthogonal to the above issue (and being discussed in a different branch of this thread).
On Tue, Sep 13, 2022 at 11:49:03AM +1000, NeilBrown wrote: > Invalidating the client cache on EVERY unmount/mount could impose > unnecessary cost. Imagine a client that caches a lot of data (several > large files) from a server which is expected to fail-over from one > cluster node to another from time to time. Adding extra delays to a > fail-over is not likely to be well received. > > I don't *know* this cost would be unacceptable, and I *would* like to > leave it to the filesystem to decide how to manage its own i_version > values. So maybe XFS can use the LSN for a salt. If people notice the > extra cost, they can complain. I'd expect complaints. NFS is actually even worse than this: it allows clients to reacquire file locks across server restart and unmount/remount, even though obviously the kernel will do nothing to prevent someone else from locking (or modifying) the file in between. Administrators are just supposed to know not to allow other applications access to the filesystem until nfsd's started. It's always been this way. You can imagine all sorts of measures to prevent that, and if anyone wants to work on ways to prevent people from shooting themselves in the foot here, great. Just taking away the ability to cache or lock across reboots wouldn't make people happy, though.... --b.
On Wed, 14 Sep 2022, J. Bruce Fields wrote: > On Tue, Sep 13, 2022 at 11:49:03AM +1000, NeilBrown wrote: > > Invalidating the client cache on EVERY unmount/mount could impose > > unnecessary cost. Imagine a client that caches a lot of data (several > > large files) from a server which is expected to fail-over from one > > cluster node to another from time to time. Adding extra delays to a > > fail-over is not likely to be well received. > > > > I don't *know* this cost would be unacceptable, and I *would* like to > > leave it to the filesystem to decide how to manage its own i_version > > values. So maybe XFS can use the LSN for a salt. If people notice the > > extra cost, they can complain. > > I'd expect complaints. > > NFS is actually even worse than this: it allows clients to reacquire > file locks across server restart and unmount/remount, even though > obviously the kernel will do nothing to prevent someone else from > locking (or modifying) the file in between. I don't understand this comment. You seem to be implying that changing the i_version during a server restart would stop a client from reclaiming locks. Is that correct? I would have thought that the client would largely ignore i_version while it has a lock or open or delegation, as these tend to imply some degree of exclusive access ("open" being least exclusive). Thanks, NeilBrown > > Administrators are just supposed to know not to allow other applications > access to the filesystem until nfsd's started. It's always been this > way. > > You can imagine all sorts of measures to prevent that, and if anyone > wants to work on ways to prevent people from shooting themselves in the > foot here, great. > > Just taking away the ability to cache or lock across reboots wouldn't > make people happy, though.... > > --b. >
On Wed, 14 Sep 2022, Jeff Layton wrote: > > At that point, bumping i_version both before and after makes a bit more > sense, since it better ensures that a change will be noticed, whether > the related read op comes before or after the statx. How does bumping it before make any sense at all? Maybe it wouldn't hurt much, but how does it help anyone at all? i_version must appear to change no sooner than the change it reflects becomes visible and no later than the request which initiated that change is acknowledged as complete. Why would that definition ever not be satisfactory? NeilBrown
On Wed, Sep 14, 2022 at 09:19:22AM +1000, NeilBrown wrote: > On Wed, 14 Sep 2022, J. Bruce Fields wrote: > > On Tue, Sep 13, 2022 at 11:49:03AM +1000, NeilBrown wrote: > > > Invalidating the client cache on EVERY unmount/mount could impose > > > unnecessary cost. Imagine a client that caches a lot of data (several > > > large files) from a server which is expected to fail-over from one > > > cluster node to another from time to time. Adding extra delays to a > > > fail-over is not likely to be well received. > > > > > > I don't *know* this cost would be unacceptable, and I *would* like to > > > leave it to the filesystem to decide how to manage its own i_version > > > values. So maybe XFS can use the LSN for a salt. If people notice the > > > extra cost, they can complain. > > > > I'd expect complaints. > > > > NFS is actually even worse than this: it allows clients to reacquire > > file locks across server restart and unmount/remount, even though > > obviously the kernel will do nothing to prevent someone else from > > locking (or modifying) the file in between. > > I don't understand this comment. You seem to be implying that changing > the i_version during a server restart would stop a client from > reclaiming locks. Is that correct? No, sorry, I'm probably being confusing. I was just saying: we've always depended in a lot of ways on the assumption that filesystems aren't messed with while nfsd's not running. You can produce all sorts of incorrect behavior by violating that assumption. That tools might fool with unmounted filesystems is just another such example, and fixing that wouldn't be very high on my list of priorities. ?? --b. > I would have thought that the client would largely ignore i_version > while it has a lock or open or delegation, as these tend to imply some > degree of exclusive access ("open" being least exclusive). > > Thanks, > NeilBrown > > > > > > Administrators are just supposed to know not to allow other applications > > access to the filesystem until nfsd's started. It's always been this > > way. > > > > You can imagine all sorts of measures to prevent that, and if anyone > > wants to work on ways to prevent people from shooting themselves in the > > foot here, great. > > > > Just taking away the ability to cache or lock across reboots wouldn't > > make people happy, though.... > > > > --b. > >
On Wed, 2022-09-14 at 09:24 +1000, NeilBrown wrote: > On Wed, 14 Sep 2022, Jeff Layton wrote: > > > > At that point, bumping i_version both before and after makes a bit more > > sense, since it better ensures that a change will be noticed, whether > > the related read op comes before or after the statx. > > How does bumping it before make any sense at all? Maybe it wouldn't > hurt much, but how does it help anyone at all? > My assumption (maybe wrong) was that timestamp updates were done before the actual write by design. Does doing it before the write make increase the chances that the inode metadata writeout will get done in the same physical I/O as the data write? IDK, just speculating here. If there's no benefit to doing it before then we should just move it afterward. > i_version must appear to change no sooner than the change it reflects > becomes visible and no later than the request which initiated that > change is acknowledged as complete. > > Why would that definition ever not be satisfactory? It's fine with me.
On Wed, 14 Sep 2022, Jeff Layton wrote: > On Wed, 2022-09-14 at 09:24 +1000, NeilBrown wrote: > > On Wed, 14 Sep 2022, Jeff Layton wrote: > > > > > > At that point, bumping i_version both before and after makes a bit more > > > sense, since it better ensures that a change will be noticed, whether > > > the related read op comes before or after the statx. > > > > How does bumping it before make any sense at all? Maybe it wouldn't > > hurt much, but how does it help anyone at all? > > > > My assumption (maybe wrong) was that timestamp updates were done before > the actual write by design. Does doing it before the write make increase > the chances that the inode metadata writeout will get done in the same > physical I/O as the data write? IDK, just speculating here. When the code was written, the inode semaphore (before mutexes) was held over the whole thing, and timestamp resolution was 1 second. So ordering didn't really matter. Since then locking has bee reduced and precision increased but no-one saw any need to fix the ordering. I think that is fine for timestamps. But i_version is about absolute precision, so we need to think carefully about what meets our needs. > > If there's no benefit to doing it before then we should just move it > afterward. Great! Thanks, NeilBrown
On Thu, 15 Sep 2022, NeilBrown wrote: > > When the code was written, the inode semaphore (before mutexes) was held > over the whole thing, and timestamp resolution was 1 second. So > ordering didn't really matter. Since then locking has bee reduced and > precision increased but no-one saw any need to fix the ordering. I > think that is fine for timestamps. Actually it is much more complex than that, though the principle is still the same https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=636b38438001a00b25f23e38747a91cb8428af29 shows i_mtime updates being moved from *after* a call to generic_file_write() in each filesystem to *early* in the body of generic_file_write(). Probably because that was just a convenient place to put it. NeilBrown
On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote: > On Mon, 12 Sep 2022, J. Bruce Fields wrote: > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote: > > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > > > > > The machine crashes and comes back up, and we get a query for i_version > > > > and it comes back as X. Fine, it's an old version. Now there is a write. > > > > What do we do to ensure that the new value doesn't collide with X+1? > > > > > > (I missed this bit in my earlier reply..) > > > > > > How is it "Fine" to see an old version? > > > The file could have changed without the version changing. > > > And I thought one of the goals of the crash-count was to be able to > > > provide a monotonic change id. > > > > I was still mainly thinking about how to provide reliable close-to-open > > semantics between NFS clients. In the case the writer was an NFS > > client, it wasn't done writing (or it would have COMMITted), so those > > writes will come in and bump the change attribute soon, and as long as > > we avoid the small chance of reusing an old change attribute, we're OK, > > and I think it'd even still be OK to advertise > > CHANGE_TYPE_IS_MONOTONIC_INCR. > > You seem to be assuming that the client doesn't crash at the same time > as the server (maybe they are both VMs on a host that lost power...) > > If client A reads and caches, client B writes, the server crashes after > writing some data (to already allocated space so no inode update needed) > but before writing the new i_version, then client B crashes. > When server comes back the i_version will be unchanged but the data has > changed. Client A will cache old data indefinitely... I guess I assume that if all we're promising is close-to-open, then a client isn't allowed to trust its cache in that situation. Maybe that's an overly draconian interpretation of close-to-open. Also, I'm trying to think about how to improve things incrementally. Incorporating something like a crash count into the on-disk i_version fixes some cases without introducing any new ones or regressing performance after a crash. If we subsequently wanted to close those remaining holes, I think we'd need the change attribute increment to be seen as atomic with respect to its associated change, both to clients and (separately) on disk. (That would still allow the change attribute to go backwards after a crash, to the value it held as of the on-disk state of the file. I think clients should be able to deal with that case.) But, I don't know, maybe a bigger hammer would be OK: > I think we need to require the filesystem to ensure that the i_version > is seen to increase shortly after any change becomes visible in the > file, and no later than the moment when the request that initiated the > change is acknowledged as being complete. In the case of an unclean > restart, any file that is not known to have been unchanged immediately > before the crash must have i_version increased. > > The simplest implementation is to have an unclean-restart counter and to > always included this multiplied by some constant X in the reported > i_version. The filesystem guarantees to record (e.g. to journal > at least) the i_version if it comes close to X more than the previous > record. The filesystem gets to choose X. So the question is whether people can live with invalidating all client caches after a cache. I don't know. > A more complex solution would be to record (similar to the way orphans > are recorded) any file which is open for write, and to add X to the > i_version for any "dirty" file still recorded during an unclean > restart. This would avoid bumping the i_version for read-only files. Is that practical? Working out the performance tradeoffs sounds like a project. > There may be other solutions, but we should leave that up to the > filesystem. Each filesystem might choose something different. Sure. --b.
On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote: > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote: > > On Mon, 12 Sep 2022, J. Bruce Fields wrote: > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote: > > > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > > > > > > > The machine crashes and comes back up, and we get a query for > > > > > i_version > > > > > and it comes back as X. Fine, it's an old version. Now there > > > > > is a write. > > > > > What do we do to ensure that the new value doesn't collide > > > > > with X+1? > > > > > > > > (I missed this bit in my earlier reply..) > > > > > > > > How is it "Fine" to see an old version? > > > > The file could have changed without the version changing. > > > > And I thought one of the goals of the crash-count was to be > > > > able to > > > > provide a monotonic change id. > > > > > > I was still mainly thinking about how to provide reliable close- > > > to-open > > > semantics between NFS clients. In the case the writer was an NFS > > > client, it wasn't done writing (or it would have COMMITted), so > > > those > > > writes will come in and bump the change attribute soon, and as > > > long as > > > we avoid the small chance of reusing an old change attribute, > > > we're OK, > > > and I think it'd even still be OK to advertise > > > CHANGE_TYPE_IS_MONOTONIC_INCR. > > > > You seem to be assuming that the client doesn't crash at the same > > time > > as the server (maybe they are both VMs on a host that lost > > power...) > > > > If client A reads and caches, client B writes, the server crashes > > after > > writing some data (to already allocated space so no inode update > > needed) > > but before writing the new i_version, then client B crashes. > > When server comes back the i_version will be unchanged but the data > > has > > changed. Client A will cache old data indefinitely... > > I guess I assume that if all we're promising is close-to-open, then a > client isn't allowed to trust its cache in that situation. Maybe > that's > an overly draconian interpretation of close-to-open. > > Also, I'm trying to think about how to improve things incrementally. > Incorporating something like a crash count into the on-disk i_version > fixes some cases without introducing any new ones or regressing > performance after a crash. > > If we subsequently wanted to close those remaining holes, I think > we'd > need the change attribute increment to be seen as atomic with respect > to > its associated change, both to clients and (separately) on disk. > (That > would still allow the change attribute to go backwards after a crash, > to > the value it held as of the on-disk state of the file. I think > clients > should be able to deal with that case.) > > But, I don't know, maybe a bigger hammer would be OK: > If you're not going to meet the minimum bar of data integrity, then this whole exercise is just a massive waste of everyone's time. The answer then going forward is just to recommend never using Linux as an NFS server. Makes my life much easier, because I no longer have to debug any of the issues. >
On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote: > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote: > > On Mon, 12 Sep 2022, J. Bruce Fields wrote: > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote: > > > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > > > > > > > The machine crashes and comes back up, and we get a query for i_version > > > > > and it comes back as X. Fine, it's an old version. Now there is a write. > > > > > What do we do to ensure that the new value doesn't collide with X+1? > > > > > > > > (I missed this bit in my earlier reply..) > > > > > > > > How is it "Fine" to see an old version? > > > > The file could have changed without the version changing. > > > > And I thought one of the goals of the crash-count was to be able to > > > > provide a monotonic change id. > > > > > > I was still mainly thinking about how to provide reliable close-to-open > > > semantics between NFS clients. In the case the writer was an NFS > > > client, it wasn't done writing (or it would have COMMITted), so those > > > writes will come in and bump the change attribute soon, and as long as > > > we avoid the small chance of reusing an old change attribute, we're OK, > > > and I think it'd even still be OK to advertise > > > CHANGE_TYPE_IS_MONOTONIC_INCR. > > > > You seem to be assuming that the client doesn't crash at the same time > > as the server (maybe they are both VMs on a host that lost power...) > > > > If client A reads and caches, client B writes, the server crashes after > > writing some data (to already allocated space so no inode update needed) > > but before writing the new i_version, then client B crashes. > > When server comes back the i_version will be unchanged but the data has > > changed. Client A will cache old data indefinitely... > > I guess I assume that if all we're promising is close-to-open, then a > client isn't allowed to trust its cache in that situation. Maybe that's > an overly draconian interpretation of close-to-open. > > Also, I'm trying to think about how to improve things incrementally. > Incorporating something like a crash count into the on-disk i_version > fixes some cases without introducing any new ones or regressing > performance after a crash. > I think we ought to start there. > If we subsequently wanted to close those remaining holes, I think we'd > need the change attribute increment to be seen as atomic with respect to > its associated change, both to clients and (separately) on disk. (That > would still allow the change attribute to go backwards after a crash, to > the value it held as of the on-disk state of the file. I think clients > should be able to deal with that case.) > > But, I don't know, maybe a bigger hammer would be OK: > > > I think we need to require the filesystem to ensure that the i_version > > is seen to increase shortly after any change becomes visible in the > > file, and no later than the moment when the request that initiated the > > change is acknowledged as being complete. In the case of an unclean > > restart, any file that is not known to have been unchanged immediately > > before the crash must have i_version increased. > > > > The simplest implementation is to have an unclean-restart counter and to > > always included this multiplied by some constant X in the reported > > i_version. The filesystem guarantees to record (e.g. to journal > > at least) the i_version if it comes close to X more than the previous > > record. The filesystem gets to choose X. > > So the question is whether people can live with invalidating all client > caches after a cache. I don't know. > I assume you mean "after a crash". Yeah, that is pretty nasty. We don't get perfect crash resilience with incorporating this into the on-disk value, but I like that better than factoring it in at presentation time. That would mean that the servers would end up getting hammered with read activity after a crash (at least in some environments). I don't think that would be worth the tradeoff. There's a real benefit to preserving caches when we can. > > A more complex solution would be to record (similar to the way orphans > > are recorded) any file which is open for write, and to add X to the > > i_version for any "dirty" file still recorded during an unclean > > restart. This would avoid bumping the i_version for read-only files. > > Is that practical? Working out the performance tradeoffs sounds like a > project. > > > > There may be other solutions, but we should leave that up to the > > filesystem. Each filesystem might choose something different. > > Sure. > Agreed here too. I think we need to allow for some flexibility here. Here's what I'm thinking: We'll carve out the upper 16 bits in the i_version counter to be the crash counter field. That gives us 8k crashes before we have to worry about collisions. Hopefully the remaining 47 bits of counter will be plenty given that we don't increment it when it's not being queried or nothing else changes. (Can we mitigate wrapping here somehow?) The easiest way to do this would be to add a u16 s_crash_counter to struct super_block. We'd initialize that to 0, and the filesystem could fill that value out at mount time. Then inode_maybe_inc_iversion can just shift the s_crash_counter that left by 24 bits and and plop it into the top of the value we're preparing to cmpxchg into place. This is backward compatible too, at least for i_version counter values that are <2^47. With anything larger, we might end up with something going backward and a possible collision, but it's (hopefully) a small risk.
On Thu, 2022-09-15 at 15:08 +0000, Trond Myklebust wrote: > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote: > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote: > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote: > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote: > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > > > > > > > > > The machine crashes and comes back up, and we get a query for > > > > > > i_version > > > > > > and it comes back as X. Fine, it's an old version. Now there > > > > > > is a write. > > > > > > What do we do to ensure that the new value doesn't collide > > > > > > with X+1? > > > > > > > > > > (I missed this bit in my earlier reply..) > > > > > > > > > > How is it "Fine" to see an old version? > > > > > The file could have changed without the version changing. > > > > > And I thought one of the goals of the crash-count was to be > > > > > able to > > > > > provide a monotonic change id. > > > > > > > > I was still mainly thinking about how to provide reliable close- > > > > to-open > > > > semantics between NFS clients. In the case the writer was an NFS > > > > client, it wasn't done writing (or it would have COMMITted), so > > > > those > > > > writes will come in and bump the change attribute soon, and as > > > > long as > > > > we avoid the small chance of reusing an old change attribute, > > > > we're OK, > > > > and I think it'd even still be OK to advertise > > > > CHANGE_TYPE_IS_MONOTONIC_INCR. > > > > > > You seem to be assuming that the client doesn't crash at the same > > > time > > > as the server (maybe they are both VMs on a host that lost > > > power...) > > > > > > If client A reads and caches, client B writes, the server crashes > > > after > > > writing some data (to already allocated space so no inode update > > > needed) > > > but before writing the new i_version, then client B crashes. > > > When server comes back the i_version will be unchanged but the data > > > has > > > changed. Client A will cache old data indefinitely... > > > > I guess I assume that if all we're promising is close-to-open, then a > > client isn't allowed to trust its cache in that situation. Maybe > > that's > > an overly draconian interpretation of close-to-open. > > > > Also, I'm trying to think about how to improve things incrementally. > > Incorporating something like a crash count into the on-disk i_version > > fixes some cases without introducing any new ones or regressing > > performance after a crash. > > > > If we subsequently wanted to close those remaining holes, I think > > we'd > > need the change attribute increment to be seen as atomic with respect > > to > > its associated change, both to clients and (separately) on disk. > > (That > > would still allow the change attribute to go backwards after a crash, > > to > > the value it held as of the on-disk state of the file. I think > > clients > > should be able to deal with that case.) > > > > But, I don't know, maybe a bigger hammer would be OK: > > > > If you're not going to meet the minimum bar of data integrity, then > this whole exercise is just a massive waste of everyone's time. The > answer then going forward is just to recommend never using Linux as an > NFS server. Makes my life much easier, because I no longer have to > debug any of the issues. > > To be clear, you believe any scheme that would allow the client to see an old change attr after a crash is insufficient? The only way I can see to fix that (at least with only a crash counter) would be to factor it in at presentation time like Neil suggested. Basically we'd just mask off the top 16 bits and plop the crash counter in there before presenting it. In principle, I suppose we could do that at the nfsd level as well (and that might be the simplest way to fix this). We probably wouldn't be able to advertise a change attr type of MONOTONIC with this scheme though.
On Thu, 2022-09-15 at 12:45 -0400, Jeff Layton wrote: > On Thu, 2022-09-15 at 15:08 +0000, Trond Myklebust wrote: > > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote: > > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote: > > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote: > > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote: > > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > > > > > > > > > > > The machine crashes and comes back up, and we get a query > > > > > > > for > > > > > > > i_version > > > > > > > and it comes back as X. Fine, it's an old version. Now > > > > > > > there > > > > > > > is a write. > > > > > > > What do we do to ensure that the new value doesn't > > > > > > > collide > > > > > > > with X+1? > > > > > > > > > > > > (I missed this bit in my earlier reply..) > > > > > > > > > > > > How is it "Fine" to see an old version? > > > > > > The file could have changed without the version changing. > > > > > > And I thought one of the goals of the crash-count was to be > > > > > > able to > > > > > > provide a monotonic change id. > > > > > > > > > > I was still mainly thinking about how to provide reliable > > > > > close- > > > > > to-open > > > > > semantics between NFS clients. In the case the writer was an > > > > > NFS > > > > > client, it wasn't done writing (or it would have COMMITted), > > > > > so > > > > > those > > > > > writes will come in and bump the change attribute soon, and > > > > > as > > > > > long as > > > > > we avoid the small chance of reusing an old change attribute, > > > > > we're OK, > > > > > and I think it'd even still be OK to advertise > > > > > CHANGE_TYPE_IS_MONOTONIC_INCR. > > > > > > > > You seem to be assuming that the client doesn't crash at the > > > > same > > > > time > > > > as the server (maybe they are both VMs on a host that lost > > > > power...) > > > > > > > > If client A reads and caches, client B writes, the server > > > > crashes > > > > after > > > > writing some data (to already allocated space so no inode > > > > update > > > > needed) > > > > but before writing the new i_version, then client B crashes. > > > > When server comes back the i_version will be unchanged but the > > > > data > > > > has > > > > changed. Client A will cache old data indefinitely... > > > > > > I guess I assume that if all we're promising is close-to-open, > > > then a > > > client isn't allowed to trust its cache in that situation. Maybe > > > that's > > > an overly draconian interpretation of close-to-open. > > > > > > Also, I'm trying to think about how to improve things > > > incrementally. > > > Incorporating something like a crash count into the on-disk > > > i_version > > > fixes some cases without introducing any new ones or regressing > > > performance after a crash. > > > > > > If we subsequently wanted to close those remaining holes, I think > > > we'd > > > need the change attribute increment to be seen as atomic with > > > respect > > > to > > > its associated change, both to clients and (separately) on disk. > > > (That > > > would still allow the change attribute to go backwards after a > > > crash, > > > to > > > the value it held as of the on-disk state of the file. I think > > > clients > > > should be able to deal with that case.) > > > > > > But, I don't know, maybe a bigger hammer would be OK: > > > > > > > If you're not going to meet the minimum bar of data integrity, then > > this whole exercise is just a massive waste of everyone's time. The > > answer then going forward is just to recommend never using Linux as > > an > > NFS server. Makes my life much easier, because I no longer have to > > debug any of the issues. > > > > > > To be clear, you believe any scheme that would allow the client to > see > an old change attr after a crash is insufficient? > Correct. If a NFSv4 client or userspace application cannot trust that it will always see a change to the change attribute value when the file data changes, then you will eventually see data corruption due to the cached data no longer matching the stored data. A false positive update of the change attribute (i.e. a case where the change attribute changes despite the data/metadata staying the same) is not desirable because it causes performance issues, but false negatives are far worse because they mean your data backup, cache, etc... are not consistent. Applications that have strong consistency requirements will have no option but to revalidate by always reading the entire file data + metadata. > The only way I can see to fix that (at least with only a crash > counter) > would be to factor it in at presentation time like Neil suggested. > Basically we'd just mask off the top 16 bits and plop the crash > counter > in there before presenting it. > > In principle, I suppose we could do that at the nfsd level as well > (and > that might be the simplest way to fix this). We probably wouldn't be > able to advertise a change attr type of MONOTONIC with this scheme > though. Why would you want to limit the crash counter to 16 bits?
On Thu, 2022-09-15 at 17:49 +0000, Trond Myklebust wrote: > On Thu, 2022-09-15 at 12:45 -0400, Jeff Layton wrote: > > On Thu, 2022-09-15 at 15:08 +0000, Trond Myklebust wrote: > > > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote: > > > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote: > > > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote: > > > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote: > > > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > > > > > > > > > > > > > The machine crashes and comes back up, and we get a query > > > > > > > > for > > > > > > > > i_version > > > > > > > > and it comes back as X. Fine, it's an old version. Now > > > > > > > > there > > > > > > > > is a write. > > > > > > > > What do we do to ensure that the new value doesn't > > > > > > > > collide > > > > > > > > with X+1? > > > > > > > > > > > > > > (I missed this bit in my earlier reply..) > > > > > > > > > > > > > > How is it "Fine" to see an old version? > > > > > > > The file could have changed without the version changing. > > > > > > > And I thought one of the goals of the crash-count was to be > > > > > > > able to > > > > > > > provide a monotonic change id. > > > > > > > > > > > > I was still mainly thinking about how to provide reliable > > > > > > close- > > > > > > to-open > > > > > > semantics between NFS clients. In the case the writer was an > > > > > > NFS > > > > > > client, it wasn't done writing (or it would have COMMITted), > > > > > > so > > > > > > those > > > > > > writes will come in and bump the change attribute soon, and > > > > > > as > > > > > > long as > > > > > > we avoid the small chance of reusing an old change attribute, > > > > > > we're OK, > > > > > > and I think it'd even still be OK to advertise > > > > > > CHANGE_TYPE_IS_MONOTONIC_INCR. > > > > > > > > > > You seem to be assuming that the client doesn't crash at the > > > > > same > > > > > time > > > > > as the server (maybe they are both VMs on a host that lost > > > > > power...) > > > > > > > > > > If client A reads and caches, client B writes, the server > > > > > crashes > > > > > after > > > > > writing some data (to already allocated space so no inode > > > > > update > > > > > needed) > > > > > but before writing the new i_version, then client B crashes. > > > > > When server comes back the i_version will be unchanged but the > > > > > data > > > > > has > > > > > changed. Client A will cache old data indefinitely... > > > > > > > > I guess I assume that if all we're promising is close-to-open, > > > > then a > > > > client isn't allowed to trust its cache in that situation. Maybe > > > > that's > > > > an overly draconian interpretation of close-to-open. > > > > > > > > Also, I'm trying to think about how to improve things > > > > incrementally. > > > > Incorporating something like a crash count into the on-disk > > > > i_version > > > > fixes some cases without introducing any new ones or regressing > > > > performance after a crash. > > > > > > > > If we subsequently wanted to close those remaining holes, I think > > > > we'd > > > > need the change attribute increment to be seen as atomic with > > > > respect > > > > to > > > > its associated change, both to clients and (separately) on disk. > > > > (That > > > > would still allow the change attribute to go backwards after a > > > > crash, > > > > to > > > > the value it held as of the on-disk state of the file. I think > > > > clients > > > > should be able to deal with that case.) > > > > > > > > But, I don't know, maybe a bigger hammer would be OK: > > > > > > > > > > If you're not going to meet the minimum bar of data integrity, then > > > this whole exercise is just a massive waste of everyone's time. The > > > answer then going forward is just to recommend never using Linux as > > > an > > > NFS server. Makes my life much easier, because I no longer have to > > > debug any of the issues. > > > > > > > > > > To be clear, you believe any scheme that would allow the client to > > see > > an old change attr after a crash is insufficient? > > > > Correct. If a NFSv4 client or userspace application cannot trust that > it will always see a change to the change attribute value when the file > data changes, then you will eventually see data corruption due to the > cached data no longer matching the stored data. > > A false positive update of the change attribute (i.e. a case where the > change attribute changes despite the data/metadata staying the same) is > not desirable because it causes performance issues, but false negatives > are far worse because they mean your data backup, cache, etc... are not > consistent. Applications that have strong consistency requirements will > have no option but to revalidate by always reading the entire file data > + metadata. > > > The only way I can see to fix that (at least with only a crash > > counter) > > would be to factor it in at presentation time like Neil suggested. > > Basically we'd just mask off the top 16 bits and plop the crash > > counter > > in there before presenting it. > > > > In principle, I suppose we could do that at the nfsd level as well > > (and > > that might be the simplest way to fix this). We probably wouldn't be > > able to advertise a change attr type of MONOTONIC with this scheme > > though. > > Why would you want to limit the crash counter to 16 bits? > To leave more room for the "real" counter. Otherwise, an inode that gets frequent writes after a long period of no crashes could experience the counter wrap. IOW, we have 63 bits to play with. Whatever part we dedicate to the crash counter will not be available for the actual version counter. I'm proposing a 16+47+1 split, but I'm happy to hear arguments for a different one.
On Thu, 2022-09-15 at 14:11 -0400, Jeff Layton wrote: > On Thu, 2022-09-15 at 17:49 +0000, Trond Myklebust wrote: > > On Thu, 2022-09-15 at 12:45 -0400, Jeff Layton wrote: > > > On Thu, 2022-09-15 at 15:08 +0000, Trond Myklebust wrote: > > > > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote: > > > > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote: > > > > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote: > > > > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown > > > > > > > wrote: > > > > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > > > > > > > > > > > > > > > The machine crashes and comes back up, and we get a > > > > > > > > > query > > > > > > > > > for > > > > > > > > > i_version > > > > > > > > > and it comes back as X. Fine, it's an old version. > > > > > > > > > Now > > > > > > > > > there > > > > > > > > > is a write. > > > > > > > > > What do we do to ensure that the new value doesn't > > > > > > > > > collide > > > > > > > > > with X+1? > > > > > > > > > > > > > > > > (I missed this bit in my earlier reply..) > > > > > > > > > > > > > > > > How is it "Fine" to see an old version? > > > > > > > > The file could have changed without the version > > > > > > > > changing. > > > > > > > > And I thought one of the goals of the crash-count was > > > > > > > > to be > > > > > > > > able to > > > > > > > > provide a monotonic change id. > > > > > > > > > > > > > > I was still mainly thinking about how to provide reliable > > > > > > > close- > > > > > > > to-open > > > > > > > semantics between NFS clients. In the case the writer > > > > > > > was an > > > > > > > NFS > > > > > > > client, it wasn't done writing (or it would have > > > > > > > COMMITted), > > > > > > > so > > > > > > > those > > > > > > > writes will come in and bump the change attribute soon, > > > > > > > and > > > > > > > as > > > > > > > long as > > > > > > > we avoid the small chance of reusing an old change > > > > > > > attribute, > > > > > > > we're OK, > > > > > > > and I think it'd even still be OK to advertise > > > > > > > CHANGE_TYPE_IS_MONOTONIC_INCR. > > > > > > > > > > > > You seem to be assuming that the client doesn't crash at > > > > > > the > > > > > > same > > > > > > time > > > > > > as the server (maybe they are both VMs on a host that lost > > > > > > power...) > > > > > > > > > > > > If client A reads and caches, client B writes, the server > > > > > > crashes > > > > > > after > > > > > > writing some data (to already allocated space so no inode > > > > > > update > > > > > > needed) > > > > > > but before writing the new i_version, then client B > > > > > > crashes. > > > > > > When server comes back the i_version will be unchanged but > > > > > > the > > > > > > data > > > > > > has > > > > > > changed. Client A will cache old data indefinitely... > > > > > > > > > > I guess I assume that if all we're promising is close-to- > > > > > open, > > > > > then a > > > > > client isn't allowed to trust its cache in that situation. > > > > > Maybe > > > > > that's > > > > > an overly draconian interpretation of close-to-open. > > > > > > > > > > Also, I'm trying to think about how to improve things > > > > > incrementally. > > > > > Incorporating something like a crash count into the on-disk > > > > > i_version > > > > > fixes some cases without introducing any new ones or > > > > > regressing > > > > > performance after a crash. > > > > > > > > > > If we subsequently wanted to close those remaining holes, I > > > > > think > > > > > we'd > > > > > need the change attribute increment to be seen as atomic with > > > > > respect > > > > > to > > > > > its associated change, both to clients and (separately) on > > > > > disk. > > > > > (That > > > > > would still allow the change attribute to go backwards after > > > > > a > > > > > crash, > > > > > to > > > > > the value it held as of the on-disk state of the file. I > > > > > think > > > > > clients > > > > > should be able to deal with that case.) > > > > > > > > > > But, I don't know, maybe a bigger hammer would be OK: > > > > > > > > > > > > > If you're not going to meet the minimum bar of data integrity, > > > > then > > > > this whole exercise is just a massive waste of everyone's time. > > > > The > > > > answer then going forward is just to recommend never using > > > > Linux as > > > > an > > > > NFS server. Makes my life much easier, because I no longer have > > > > to > > > > debug any of the issues. > > > > > > > > > > > > > > To be clear, you believe any scheme that would allow the client > > > to > > > see > > > an old change attr after a crash is insufficient? > > > > > > > Correct. If a NFSv4 client or userspace application cannot trust > > that > > it will always see a change to the change attribute value when the > > file > > data changes, then you will eventually see data corruption due to > > the > > cached data no longer matching the stored data. > > > > A false positive update of the change attribute (i.e. a case where > > the > > change attribute changes despite the data/metadata staying the > > same) is > > not desirable because it causes performance issues, but false > > negatives > > are far worse because they mean your data backup, cache, etc... are > > not > > consistent. Applications that have strong consistency requirements > > will > > have no option but to revalidate by always reading the entire file > > data > > + metadata. > > > > > The only way I can see to fix that (at least with only a crash > > > counter) > > > would be to factor it in at presentation time like Neil > > > suggested. > > > Basically we'd just mask off the top 16 bits and plop the crash > > > counter > > > in there before presenting it. > > > > > > In principle, I suppose we could do that at the nfsd level as > > > well > > > (and > > > that might be the simplest way to fix this). We probably wouldn't > > > be > > > able to advertise a change attr type of MONOTONIC with this > > > scheme > > > though. > > > > Why would you want to limit the crash counter to 16 bits? > > > > To leave more room for the "real" counter. Otherwise, an inode that > gets > frequent writes after a long period of no crashes could experience > the > counter wrap. > > IOW, we have 63 bits to play with. Whatever part we dedicate to the > crash counter will not be available for the actual version counter. > > I'm proposing a 16+47+1 split, but I'm happy to hear arguments for a > different one. What is the expectation when you have an unclean shutdown or crash? Do all change attribute values get updated to reflect the new crash counter value, or only some? If the answer is that 'all values change', then why store the crash counter in the inode at all? Why not just add it as an offset when you're generating the user-visible change attribute? i.e. statx.change_attr = inode->i_version + (crash counter * offset) (where offset is chosen to be larger than the max number of inode- >i_version updates that could get lost by an inode in a crash). Presumably that offset could be significantly smaller than 2^63...
On Thu, 2022-09-15 at 19:03 +0000, Trond Myklebust wrote: > On Thu, 2022-09-15 at 14:11 -0400, Jeff Layton wrote: > > On Thu, 2022-09-15 at 17:49 +0000, Trond Myklebust wrote: > > > On Thu, 2022-09-15 at 12:45 -0400, Jeff Layton wrote: > > > > On Thu, 2022-09-15 at 15:08 +0000, Trond Myklebust wrote: > > > > > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote: > > > > > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote: > > > > > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote: > > > > > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown > > > > > > > > wrote: > > > > > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > > > > > > > > > > > > > > > > > The machine crashes and comes back up, and we get a > > > > > > > > > > query > > > > > > > > > > for > > > > > > > > > > i_version > > > > > > > > > > and it comes back as X. Fine, it's an old version. > > > > > > > > > > Now > > > > > > > > > > there > > > > > > > > > > is a write. > > > > > > > > > > What do we do to ensure that the new value doesn't > > > > > > > > > > collide > > > > > > > > > > with X+1? > > > > > > > > > > > > > > > > > > (I missed this bit in my earlier reply..) > > > > > > > > > > > > > > > > > > How is it "Fine" to see an old version? > > > > > > > > > The file could have changed without the version > > > > > > > > > changing. > > > > > > > > > And I thought one of the goals of the crash-count was > > > > > > > > > to be > > > > > > > > > able to > > > > > > > > > provide a monotonic change id. > > > > > > > > > > > > > > > > I was still mainly thinking about how to provide reliable > > > > > > > > close- > > > > > > > > to-open > > > > > > > > semantics between NFS clients. In the case the writer > > > > > > > > was an > > > > > > > > NFS > > > > > > > > client, it wasn't done writing (or it would have > > > > > > > > COMMITted), > > > > > > > > so > > > > > > > > those > > > > > > > > writes will come in and bump the change attribute soon, > > > > > > > > and > > > > > > > > as > > > > > > > > long as > > > > > > > > we avoid the small chance of reusing an old change > > > > > > > > attribute, > > > > > > > > we're OK, > > > > > > > > and I think it'd even still be OK to advertise > > > > > > > > CHANGE_TYPE_IS_MONOTONIC_INCR. > > > > > > > > > > > > > > You seem to be assuming that the client doesn't crash at > > > > > > > the > > > > > > > same > > > > > > > time > > > > > > > as the server (maybe they are both VMs on a host that lost > > > > > > > power...) > > > > > > > > > > > > > > If client A reads and caches, client B writes, the server > > > > > > > crashes > > > > > > > after > > > > > > > writing some data (to already allocated space so no inode > > > > > > > update > > > > > > > needed) > > > > > > > but before writing the new i_version, then client B > > > > > > > crashes. > > > > > > > When server comes back the i_version will be unchanged but > > > > > > > the > > > > > > > data > > > > > > > has > > > > > > > changed. Client A will cache old data indefinitely... > > > > > > > > > > > > I guess I assume that if all we're promising is close-to- > > > > > > open, > > > > > > then a > > > > > > client isn't allowed to trust its cache in that situation. > > > > > > Maybe > > > > > > that's > > > > > > an overly draconian interpretation of close-to-open. > > > > > > > > > > > > Also, I'm trying to think about how to improve things > > > > > > incrementally. > > > > > > Incorporating something like a crash count into the on-disk > > > > > > i_version > > > > > > fixes some cases without introducing any new ones or > > > > > > regressing > > > > > > performance after a crash. > > > > > > > > > > > > If we subsequently wanted to close those remaining holes, I > > > > > > think > > > > > > we'd > > > > > > need the change attribute increment to be seen as atomic with > > > > > > respect > > > > > > to > > > > > > its associated change, both to clients and (separately) on > > > > > > disk. > > > > > > (That > > > > > > would still allow the change attribute to go backwards after > > > > > > a > > > > > > crash, > > > > > > to > > > > > > the value it held as of the on-disk state of the file. I > > > > > > think > > > > > > clients > > > > > > should be able to deal with that case.) > > > > > > > > > > > > But, I don't know, maybe a bigger hammer would be OK: > > > > > > > > > > > > > > > > If you're not going to meet the minimum bar of data integrity, > > > > > then > > > > > this whole exercise is just a massive waste of everyone's time. > > > > > The > > > > > answer then going forward is just to recommend never using > > > > > Linux as > > > > > an > > > > > NFS server. Makes my life much easier, because I no longer have > > > > > to > > > > > debug any of the issues. > > > > > > > > > > > > > > > > > > To be clear, you believe any scheme that would allow the client > > > > to > > > > see > > > > an old change attr after a crash is insufficient? > > > > > > > > > > Correct. If a NFSv4 client or userspace application cannot trust > > > that > > > it will always see a change to the change attribute value when the > > > file > > > data changes, then you will eventually see data corruption due to > > > the > > > cached data no longer matching the stored data. > > > > > > A false positive update of the change attribute (i.e. a case where > > > the > > > change attribute changes despite the data/metadata staying the > > > same) is > > > not desirable because it causes performance issues, but false > > > negatives > > > are far worse because they mean your data backup, cache, etc... are > > > not > > > consistent. Applications that have strong consistency requirements > > > will > > > have no option but to revalidate by always reading the entire file > > > data > > > + metadata. > > > > > > > The only way I can see to fix that (at least with only a crash > > > > counter) > > > > would be to factor it in at presentation time like Neil > > > > suggested. > > > > Basically we'd just mask off the top 16 bits and plop the crash > > > > counter > > > > in there before presenting it. > > > > > > > > In principle, I suppose we could do that at the nfsd level as > > > > well > > > > (and > > > > that might be the simplest way to fix this). We probably wouldn't > > > > be > > > > able to advertise a change attr type of MONOTONIC with this > > > > scheme > > > > though. > > > > > > Why would you want to limit the crash counter to 16 bits? > > > > > > > To leave more room for the "real" counter. Otherwise, an inode that > > gets > > frequent writes after a long period of no crashes could experience > > the > > counter wrap. > > > > IOW, we have 63 bits to play with. Whatever part we dedicate to the > > crash counter will not be available for the actual version counter. > > > > I'm proposing a 16+47+1 split, but I'm happy to hear arguments for a > > different one. > > > What is the expectation when you have an unclean shutdown or crash? Do > all change attribute values get updated to reflect the new crash > counter value, or only some? > > If the answer is that 'all values change', then why store the crash > counter in the inode at all? Why not just add it as an offset when > you're generating the user-visible change attribute? > > i.e. statx.change_attr = inode->i_version + (crash counter * offset) > > (where offset is chosen to be larger than the max number of inode- > > i_version updates that could get lost by an inode in a crash). > > Presumably that offset could be significantly smaller than 2^63... > Yes, if we plan to ensure that all the change attrs change after a crash, we can do that. So what would make sense for an offset? Maybe 2**12? One would hope that there wouldn't be more than 4k increments before one of them made it to disk. OTOH, maybe that can happen with teeny-tiny writes. If we want to leave this up to the filesystem, I guess we could just add a new struct super_block.s_version_offset field and let the filesystem precompute that value and set it at mount time. Then we can just add that in after querying i_version.
On Fri, 16 Sep 2022, Jeff Layton wrote: > On Thu, 2022-09-15 at 19:03 +0000, Trond Myklebust wrote: > > On Thu, 2022-09-15 at 14:11 -0400, Jeff Layton wrote: > > > On Thu, 2022-09-15 at 17:49 +0000, Trond Myklebust wrote: > > > > On Thu, 2022-09-15 at 12:45 -0400, Jeff Layton wrote: > > > > > On Thu, 2022-09-15 at 15:08 +0000, Trond Myklebust wrote: > > > > > > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote: > > > > > > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote: > > > > > > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote: > > > > > > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown > > > > > > > > > wrote: > > > > > > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > > > > > > > > > > > > > > > > > > > The machine crashes and comes back up, and we get a > > > > > > > > > > > query > > > > > > > > > > > for > > > > > > > > > > > i_version > > > > > > > > > > > and it comes back as X. Fine, it's an old version. > > > > > > > > > > > Now > > > > > > > > > > > there > > > > > > > > > > > is a write. > > > > > > > > > > > What do we do to ensure that the new value doesn't > > > > > > > > > > > collide > > > > > > > > > > > with X+1? > > > > > > > > > > > > > > > > > > > > (I missed this bit in my earlier reply..) > > > > > > > > > > > > > > > > > > > > How is it "Fine" to see an old version? > > > > > > > > > > The file could have changed without the version > > > > > > > > > > changing. > > > > > > > > > > And I thought one of the goals of the crash-count was > > > > > > > > > > to be > > > > > > > > > > able to > > > > > > > > > > provide a monotonic change id. > > > > > > > > > > > > > > > > > > I was still mainly thinking about how to provide reliable > > > > > > > > > close- > > > > > > > > > to-open > > > > > > > > > semantics between NFS clients. In the case the writer > > > > > > > > > was an > > > > > > > > > NFS > > > > > > > > > client, it wasn't done writing (or it would have > > > > > > > > > COMMITted), > > > > > > > > > so > > > > > > > > > those > > > > > > > > > writes will come in and bump the change attribute soon, > > > > > > > > > and > > > > > > > > > as > > > > > > > > > long as > > > > > > > > > we avoid the small chance of reusing an old change > > > > > > > > > attribute, > > > > > > > > > we're OK, > > > > > > > > > and I think it'd even still be OK to advertise > > > > > > > > > CHANGE_TYPE_IS_MONOTONIC_INCR. > > > > > > > > > > > > > > > > You seem to be assuming that the client doesn't crash at > > > > > > > > the > > > > > > > > same > > > > > > > > time > > > > > > > > as the server (maybe they are both VMs on a host that lost > > > > > > > > power...) > > > > > > > > > > > > > > > > If client A reads and caches, client B writes, the server > > > > > > > > crashes > > > > > > > > after > > > > > > > > writing some data (to already allocated space so no inode > > > > > > > > update > > > > > > > > needed) > > > > > > > > but before writing the new i_version, then client B > > > > > > > > crashes. > > > > > > > > When server comes back the i_version will be unchanged but > > > > > > > > the > > > > > > > > data > > > > > > > > has > > > > > > > > changed. Client A will cache old data indefinitely... > > > > > > > > > > > > > > I guess I assume that if all we're promising is close-to- > > > > > > > open, > > > > > > > then a > > > > > > > client isn't allowed to trust its cache in that situation. > > > > > > > Maybe > > > > > > > that's > > > > > > > an overly draconian interpretation of close-to-open. > > > > > > > > > > > > > > Also, I'm trying to think about how to improve things > > > > > > > incrementally. > > > > > > > Incorporating something like a crash count into the on-disk > > > > > > > i_version > > > > > > > fixes some cases without introducing any new ones or > > > > > > > regressing > > > > > > > performance after a crash. > > > > > > > > > > > > > > If we subsequently wanted to close those remaining holes, I > > > > > > > think > > > > > > > we'd > > > > > > > need the change attribute increment to be seen as atomic with > > > > > > > respect > > > > > > > to > > > > > > > its associated change, both to clients and (separately) on > > > > > > > disk. > > > > > > > (That > > > > > > > would still allow the change attribute to go backwards after > > > > > > > a > > > > > > > crash, > > > > > > > to > > > > > > > the value it held as of the on-disk state of the file. I > > > > > > > think > > > > > > > clients > > > > > > > should be able to deal with that case.) > > > > > > > > > > > > > > But, I don't know, maybe a bigger hammer would be OK: > > > > > > > > > > > > > > > > > > > If you're not going to meet the minimum bar of data integrity, > > > > > > then > > > > > > this whole exercise is just a massive waste of everyone's time. > > > > > > The > > > > > > answer then going forward is just to recommend never using > > > > > > Linux as > > > > > > an > > > > > > NFS server. Makes my life much easier, because I no longer have > > > > > > to > > > > > > debug any of the issues. > > > > > > > > > > > > > > > > > > > > > > To be clear, you believe any scheme that would allow the client > > > > > to > > > > > see > > > > > an old change attr after a crash is insufficient? > > > > > > > > > > > > > Correct. If a NFSv4 client or userspace application cannot trust > > > > that > > > > it will always see a change to the change attribute value when the > > > > file > > > > data changes, then you will eventually see data corruption due to > > > > the > > > > cached data no longer matching the stored data. > > > > > > > > A false positive update of the change attribute (i.e. a case where > > > > the > > > > change attribute changes despite the data/metadata staying the > > > > same) is > > > > not desirable because it causes performance issues, but false > > > > negatives > > > > are far worse because they mean your data backup, cache, etc... are > > > > not > > > > consistent. Applications that have strong consistency requirements > > > > will > > > > have no option but to revalidate by always reading the entire file > > > > data > > > > + metadata. > > > > > > > > > The only way I can see to fix that (at least with only a crash > > > > > counter) > > > > > would be to factor it in at presentation time like Neil > > > > > suggested. > > > > > Basically we'd just mask off the top 16 bits and plop the crash > > > > > counter > > > > > in there before presenting it. > > > > > > > > > > In principle, I suppose we could do that at the nfsd level as > > > > > well > > > > > (and > > > > > that might be the simplest way to fix this). We probably wouldn't > > > > > be > > > > > able to advertise a change attr type of MONOTONIC with this > > > > > scheme > > > > > though. > > > > > > > > Why would you want to limit the crash counter to 16 bits? > > > > > > > > > > To leave more room for the "real" counter. Otherwise, an inode that > > > gets > > > frequent writes after a long period of no crashes could experience > > > the > > > counter wrap. > > > > > > IOW, we have 63 bits to play with. Whatever part we dedicate to the > > > crash counter will not be available for the actual version counter. > > > > > > I'm proposing a 16+47+1 split, but I'm happy to hear arguments for a > > > different one. > > > > > > What is the expectation when you have an unclean shutdown or crash? Do > > all change attribute values get updated to reflect the new crash > > counter value, or only some? > > > > If the answer is that 'all values change', then why store the crash > > counter in the inode at all? Why not just add it as an offset when > > you're generating the user-visible change attribute? > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset) > > > > (where offset is chosen to be larger than the max number of inode- > > > i_version updates that could get lost by an inode in a crash). > > > > Presumably that offset could be significantly smaller than 2^63... > > > > > Yes, if we plan to ensure that all the change attrs change after a > crash, we can do that. > > So what would make sense for an offset? Maybe 2**12? One would hope that > there wouldn't be more than 4k increments before one of them made it to > disk. OTOH, maybe that can happen with teeny-tiny writes. Leave it up the to filesystem to decide. The VFS and/or NFSD should have not have part in calculating the i_version. It should be entirely in the filesystem - though support code could be provided if common patterns exist across filesystems. A filesystem *could* decide to ensure the on-disk i_version is updated when the difference between in-memory and on-disk reaches X/2, and add X after an unclean restart. Or it could just choose a large X and hope. Or it could do something else that neither of us has thought of. But PLEASE leave the filesystem in control, do not make it fit with our pre-conceived ideas of what would be easy for it. > > If we want to leave this up to the filesystem, I guess we could just add > a new struct super_block.s_version_offset field and let the filesystem > precompute that value and set it at mount time. Then we can just add > that in after querying i_version. If we are leaving "this up to the filesystem", the we don't add anything to struct super_block and we don't add anything "in after querying i_version". Rather, we "leave this up to the filesystem" and use exactly the i_version that the filesystem provides. We only provide advice as to minimum requirements, preferred behaviours, and possible implementation suggestions. NeilBrown > -- > Jeff Layton <jlayton@kernel.org> >
On Fri, 16 Sep 2022, Jeff Layton wrote: > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote: > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote: > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote: > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote: > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > > > > > > > > > The machine crashes and comes back up, and we get a query for i_version > > > > > > and it comes back as X. Fine, it's an old version. Now there is a write. > > > > > > What do we do to ensure that the new value doesn't collide with X+1? > > > > > > > > > > (I missed this bit in my earlier reply..) > > > > > > > > > > How is it "Fine" to see an old version? > > > > > The file could have changed without the version changing. > > > > > And I thought one of the goals of the crash-count was to be able to > > > > > provide a monotonic change id. > > > > > > > > I was still mainly thinking about how to provide reliable close-to-open > > > > semantics between NFS clients. In the case the writer was an NFS > > > > client, it wasn't done writing (or it would have COMMITted), so those > > > > writes will come in and bump the change attribute soon, and as long as > > > > we avoid the small chance of reusing an old change attribute, we're OK, > > > > and I think it'd even still be OK to advertise > > > > CHANGE_TYPE_IS_MONOTONIC_INCR. > > > > > > You seem to be assuming that the client doesn't crash at the same time > > > as the server (maybe they are both VMs on a host that lost power...) > > > > > > If client A reads and caches, client B writes, the server crashes after > > > writing some data (to already allocated space so no inode update needed) > > > but before writing the new i_version, then client B crashes. > > > When server comes back the i_version will be unchanged but the data has > > > changed. Client A will cache old data indefinitely... > > > > I guess I assume that if all we're promising is close-to-open, then a > > client isn't allowed to trust its cache in that situation. Maybe that's > > an overly draconian interpretation of close-to-open. > > > > Also, I'm trying to think about how to improve things incrementally. > > Incorporating something like a crash count into the on-disk i_version > > fixes some cases without introducing any new ones or regressing > > performance after a crash. > > > > I think we ought to start there. > > > If we subsequently wanted to close those remaining holes, I think we'd > > need the change attribute increment to be seen as atomic with respect to > > its associated change, both to clients and (separately) on disk. (That > > would still allow the change attribute to go backwards after a crash, to > > the value it held as of the on-disk state of the file. I think clients > > should be able to deal with that case.) > > > > But, I don't know, maybe a bigger hammer would be OK: > > > > > I think we need to require the filesystem to ensure that the i_version > > > is seen to increase shortly after any change becomes visible in the > > > file, and no later than the moment when the request that initiated the > > > change is acknowledged as being complete. In the case of an unclean > > > restart, any file that is not known to have been unchanged immediately > > > before the crash must have i_version increased. > > > > > > The simplest implementation is to have an unclean-restart counter and to > > > always included this multiplied by some constant X in the reported > > > i_version. The filesystem guarantees to record (e.g. to journal > > > at least) the i_version if it comes close to X more than the previous > > > record. The filesystem gets to choose X. > > > > So the question is whether people can live with invalidating all client > > caches after a cache. I don't know. > > > > I assume you mean "after a crash". Yeah, that is pretty nasty. We don't > get perfect crash resilience with incorporating this into the on-disk > value, but I like that better than factoring it in at presentation time. > > That would mean that the servers would end up getting hammered with read > activity after a crash (at least in some environments). I don't think > that would be worth the tradeoff. There's a real benefit to preserving > caches when we can. Would it really mean the server gets hammered? For files and NFSv4, any significant cache should be held on the basis of a delegation, and if the client holds a delegation then it shouldn't be paying attention to i_version. I'm not entirely sure of this. Section 10.2.1 of RFC 5661 seems to suggest that when the client uses CLAIM_DELEG_PREV to reclaim a delegation, it must then return the delegation. However the explanation seems to be mostly about WRITE delegations and immediately flushing cached changes. Do we know if there is a way for the server to say "OK, you have that delegation again" in a way that the client can keep the delegation and continue to ignore i_version? For directories, which cannot be delegated the same way but can still be cached, the issues are different. All directory morphing operations will be journalled by the filesystem so it should be able to keep the i_version up to date. So the (journalling) filesystem should *NOT* add a crash-count to the i_version for directories even if it does for files. NeilBrown > > > > A more complex solution would be to record (similar to the way orphans > > > are recorded) any file which is open for write, and to add X to the > > > i_version for any "dirty" file still recorded during an unclean > > > restart. This would avoid bumping the i_version for read-only files. > > > > Is that practical? Working out the performance tradeoffs sounds like a > > project. > > > > > > > There may be other solutions, but we should leave that up to the > > > filesystem. Each filesystem might choose something different. > > > > Sure. > > > > Agreed here too. I think we need to allow for some flexibility here. > > Here's what I'm thinking: > > We'll carve out the upper 16 bits in the i_version counter to be the > crash counter field. That gives us 8k crashes before we have to worry > about collisions. Hopefully the remaining 47 bits of counter will be > plenty given that we don't increment it when it's not being queried or > nothing else changes. (Can we mitigate wrapping here somehow?) > > The easiest way to do this would be to add a u16 s_crash_counter to > struct super_block. We'd initialize that to 0, and the filesystem could > fill that value out at mount time. > > Then inode_maybe_inc_iversion can just shift the s_crash_counter that > left by 24 bits and and plop it into the top of the value we're > preparing to cmpxchg into place. > > This is backward compatible too, at least for i_version counter values > that are <2^47. With anything larger, we might end up with something > going backward and a possible collision, but it's (hopefully) a small > risk. > > -- > Jeff Layton <jlayton@kernel.org> >
On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote: > > > If the answer is that 'all values change', then why store the crash > > > counter in the inode at all? Why not just add it as an offset when > > > you're generating the user-visible change attribute? > > > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset) I had suggested just hashing the crash counter with the file system's on-disk i_version number, which is essentially what you are suggested. > > Yes, if we plan to ensure that all the change attrs change after a > > crash, we can do that. > > > > So what would make sense for an offset? Maybe 2**12? One would hope that > > there wouldn't be more than 4k increments before one of them made it to > > disk. OTOH, maybe that can happen with teeny-tiny writes. > > Leave it up the to filesystem to decide. The VFS and/or NFSD should > have not have part in calculating the i_version. It should be entirely > in the filesystem - though support code could be provided if common > patterns exist across filesystems. Oh, *heck* no. This parameter is for the NFS implementation to decide, because it's NFS's caching algorithms which are at stake here. As a the file system maintainer, I had offered to make an on-disk "crash counter" which would get updated when the journal had gotten replayed, in addition to the on-disk i_version number. This will be available for the Linux implementation of NFSD to use, but that's up to *you* to decide how you want to use them. I was perfectly happy with hashing the crash counter and the i_version because I had assumed that not *that* much stuff was going to be cached, and so invalidating all of the caches in the unusual case where there was a crash was acceptable. After all it's a !@#?!@ cache. Caches sometimmes get invalidated. "That is the order of things." (as Ramata'Klan once said in "Rocks and Shoals") But if people expect that multiple TB's of data is going to be stored; that cache invalidation is unacceptable; and that a itsy-weeny chance of false negative failures which might cause data corruption might be acceptable tradeoff, hey, that's for the system which is providing caching semantics to determine. PLEASE don't put this tradeoff on the file system authors; I would much prefer to leave this tradeoff in the hands of the system which is trying to do the caching. - Ted
On Fri, 2022-09-16 at 08:42 +1000, NeilBrown wrote: > On Fri, 16 Sep 2022, Jeff Layton wrote: > > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote: > > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote: > > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote: > > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote: > > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote: > > > > > > > > > > > > > > The machine crashes and comes back up, and we get a query for i_version > > > > > > > and it comes back as X. Fine, it's an old version. Now there is a write. > > > > > > > What do we do to ensure that the new value doesn't collide with X+1? > > > > > > > > > > > > (I missed this bit in my earlier reply..) > > > > > > > > > > > > How is it "Fine" to see an old version? > > > > > > The file could have changed without the version changing. > > > > > > And I thought one of the goals of the crash-count was to be able to > > > > > > provide a monotonic change id. > > > > > > > > > > I was still mainly thinking about how to provide reliable close-to-open > > > > > semantics between NFS clients. In the case the writer was an NFS > > > > > client, it wasn't done writing (or it would have COMMITted), so those > > > > > writes will come in and bump the change attribute soon, and as long as > > > > > we avoid the small chance of reusing an old change attribute, we're OK, > > > > > and I think it'd even still be OK to advertise > > > > > CHANGE_TYPE_IS_MONOTONIC_INCR. > > > > > > > > You seem to be assuming that the client doesn't crash at the same time > > > > as the server (maybe they are both VMs on a host that lost power...) > > > > > > > > If client A reads and caches, client B writes, the server crashes after > > > > writing some data (to already allocated space so no inode update needed) > > > > but before writing the new i_version, then client B crashes. > > > > When server comes back the i_version will be unchanged but the data has > > > > changed. Client A will cache old data indefinitely... > > > > > > I guess I assume that if all we're promising is close-to-open, then a > > > client isn't allowed to trust its cache in that situation. Maybe that's > > > an overly draconian interpretation of close-to-open. > > > > > > Also, I'm trying to think about how to improve things incrementally. > > > Incorporating something like a crash count into the on-disk i_version > > > fixes some cases without introducing any new ones or regressing > > > performance after a crash. > > > > > > > I think we ought to start there. > > > > > If we subsequently wanted to close those remaining holes, I think we'd > > > need the change attribute increment to be seen as atomic with respect to > > > its associated change, both to clients and (separately) on disk. (That > > > would still allow the change attribute to go backwards after a crash, to > > > the value it held as of the on-disk state of the file. I think clients > > > should be able to deal with that case.) > > > > > > But, I don't know, maybe a bigger hammer would be OK: > > > > > > > I think we need to require the filesystem to ensure that the i_version > > > > is seen to increase shortly after any change becomes visible in the > > > > file, and no later than the moment when the request that initiated the > > > > change is acknowledged as being complete. In the case of an unclean > > > > restart, any file that is not known to have been unchanged immediately > > > > before the crash must have i_version increased. > > > > > > > > The simplest implementation is to have an unclean-restart counter and to > > > > always included this multiplied by some constant X in the reported > > > > i_version. The filesystem guarantees to record (e.g. to journal > > > > at least) the i_version if it comes close to X more than the previous > > > > record. The filesystem gets to choose X. > > > > > > So the question is whether people can live with invalidating all client > > > caches after a cache. I don't know. > > > > > > > I assume you mean "after a crash". Yeah, that is pretty nasty. We don't > > get perfect crash resilience with incorporating this into the on-disk > > value, but I like that better than factoring it in at presentation time. > > > > That would mean that the servers would end up getting hammered with read > > activity after a crash (at least in some environments). I don't think > > that would be worth the tradeoff. There's a real benefit to preserving > > caches when we can. > > Would it really mean the server gets hammered? > Traditionally, yes. That was the rationale for fscache, after all. Particularly in large renderfarms, when rebooting a large swath of client machines, they end up with blank caches and when they come up they hammer the server with READs. We'll be back to that behavior after a crash with this scheme, since fscache uses the change attribute to determine cache validity. I guess that's unavoidable for now. > For files and NFSv4, any significant cache should be held on the basis > of a delegation, and if the client holds a delegation then it shouldn't > be paying attention to i_version. > > I'm not entirely sure of this. Section 10.2.1 of RFC 5661 seems to > suggest that when the client uses CLAIM_DELEG_PREV to reclaim a > delegation, it must then return the delegation. However the explanation > seems to be mostly about WRITE delegations and immediately flushing > cached changes. Do we know if there is a way for the server to say "OK, > you have that delegation again" in a way that the client can keep the > delegation and continue to ignore i_version? > Delegations may change that calculus. In general I've noticed that the client tends to ignore attribute cache changes when it has a delegation. > For directories, which cannot be delegated the same way but can still be > cached, the issues are different. All directory morphing operations > will be journalled by the filesystem so it should be able to keep the > i_version up to date. So the (journalling) filesystem should *NOT* add > a crash-count to the i_version for directories even if it does for files. > Interesting and good point. We should be able to make that distinction and just mix in the crash counter for regular files. > > > > > > > > A more complex solution would be to record (similar to the way orphans > > > > are recorded) any file which is open for write, and to add X to the > > > > i_version for any "dirty" file still recorded during an unclean > > > > restart. This would avoid bumping the i_version for read-only files. > > > > > > Is that practical? Working out the performance tradeoffs sounds like a > > > project. > > > > > > > > > > There may be other solutions, but we should leave that up to the > > > > filesystem. Each filesystem might choose something different. > > > > > > Sure. > > > > > > > Agreed here too. I think we need to allow for some flexibility here. > > > > Here's what I'm thinking: > > > > We'll carve out the upper 16 bits in the i_version counter to be the > > crash counter field. That gives us 8k crashes before we have to worry > > about collisions. Hopefully the remaining 47 bits of counter will be > > plenty given that we don't increment it when it's not being queried or > > nothing else changes. (Can we mitigate wrapping here somehow?) > > > > The easiest way to do this would be to add a u16 s_crash_counter to > > struct super_block. We'd initialize that to 0, and the filesystem could > > fill that value out at mount time. > > > > Then inode_maybe_inc_iversion can just shift the s_crash_counter that > > left by 24 bits and and plop it into the top of the value we're > > preparing to cmpxchg into place. > > > > This is backward compatible too, at least for i_version counter values > > that are <2^47. With anything larger, we might end up with something > > going backward and a possible collision, but it's (hopefully) a small > > risk. > > > > -- > > Jeff Layton <jlayton@kernel.org> > >
On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote: > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote: > > > > If the answer is that 'all values change', then why store the crash > > > > counter in the inode at all? Why not just add it as an offset when > > > > you're generating the user-visible change attribute? > > > > > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset) > > I had suggested just hashing the crash counter with the file system's > on-disk i_version number, which is essentially what you are suggested. > > > > Yes, if we plan to ensure that all the change attrs change after a > > > crash, we can do that. > > > > > > So what would make sense for an offset? Maybe 2**12? One would hope that > > > there wouldn't be more than 4k increments before one of them made it to > > > disk. OTOH, maybe that can happen with teeny-tiny writes. > > > > Leave it up the to filesystem to decide. The VFS and/or NFSD should > > have not have part in calculating the i_version. It should be entirely > > in the filesystem - though support code could be provided if common > > patterns exist across filesystems. > > Oh, *heck* no. This parameter is for the NFS implementation to > decide, because it's NFS's caching algorithms which are at stake here. > > As a the file system maintainer, I had offered to make an on-disk > "crash counter" which would get updated when the journal had gotten > replayed, in addition to the on-disk i_version number. This will be > available for the Linux implementation of NFSD to use, but that's up > to *you* to decide how you want to use them. > > I was perfectly happy with hashing the crash counter and the i_version > because I had assumed that not *that* much stuff was going to be > cached, and so invalidating all of the caches in the unusual case > where there was a crash was acceptable. After all it's a !@#?!@ > cache. Caches sometimmes get invalidated. "That is the order of > things." (as Ramata'Klan once said in "Rocks and Shoals") > > But if people expect that multiple TB's of data is going to be stored; > that cache invalidation is unacceptable; and that a itsy-weeny chance > of false negative failures which might cause data corruption might be > acceptable tradeoff, hey, that's for the system which is providing > caching semantics to determine. > > PLEASE don't put this tradeoff on the file system authors; I would > much prefer to leave this tradeoff in the hands of the system which is > trying to do the caching. > Yeah, if we were designing this from scratch, I might agree with leaving more up to the filesystem, but the existing users all have pretty much the same needs. I'm going to plan to try to keep most of this in the common infrastructure defined in iversion.h. Ted, for the ext4 crash counter, what wordsize were you thinking? I doubt we'll be able to use much more than 32 bits so a larger integer is probably not worthwhile. There are several holes in struct super_block (at least on x86_64), so adding this field to the generic structure needn't grow it.
On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote: > On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote: > > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote: > > > > > If the answer is that 'all values change', then why store the crash > > > > > counter in the inode at all? Why not just add it as an offset when > > > > > you're generating the user-visible change attribute? > > > > > > > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset) > > > > I had suggested just hashing the crash counter with the file system's > > on-disk i_version number, which is essentially what you are suggested. > > > > > > Yes, if we plan to ensure that all the change attrs change after a > > > > crash, we can do that. > > > > > > > > So what would make sense for an offset? Maybe 2**12? One would hope that > > > > there wouldn't be more than 4k increments before one of them made it to > > > > disk. OTOH, maybe that can happen with teeny-tiny writes. > > > > > > Leave it up the to filesystem to decide. The VFS and/or NFSD should > > > have not have part in calculating the i_version. It should be entirely > > > in the filesystem - though support code could be provided if common > > > patterns exist across filesystems. > > > > Oh, *heck* no. This parameter is for the NFS implementation to > > decide, because it's NFS's caching algorithms which are at stake here. > > > > As a the file system maintainer, I had offered to make an on-disk > > "crash counter" which would get updated when the journal had gotten > > replayed, in addition to the on-disk i_version number. This will be > > available for the Linux implementation of NFSD to use, but that's up > > to *you* to decide how you want to use them. > > > > I was perfectly happy with hashing the crash counter and the i_version > > because I had assumed that not *that* much stuff was going to be > > cached, and so invalidating all of the caches in the unusual case > > where there was a crash was acceptable. After all it's a !@#?!@ > > cache. Caches sometimmes get invalidated. "That is the order of > > things." (as Ramata'Klan once said in "Rocks and Shoals") > > > > But if people expect that multiple TB's of data is going to be stored; > > that cache invalidation is unacceptable; and that a itsy-weeny chance > > of false negative failures which might cause data corruption might be > > acceptable tradeoff, hey, that's for the system which is providing > > caching semantics to determine. > > > > PLEASE don't put this tradeoff on the file system authors; I would > > much prefer to leave this tradeoff in the hands of the system which is > > trying to do the caching. > > > > Yeah, if we were designing this from scratch, I might agree with leaving > more up to the filesystem, but the existing users all have pretty much > the same needs. I'm going to plan to try to keep most of this in the > common infrastructure defined in iversion.h. > > Ted, for the ext4 crash counter, what wordsize were you thinking? I > doubt we'll be able to use much more than 32 bits so a larger integer is > probably not worthwhile. There are several holes in struct super_block > (at least on x86_64), so adding this field to the generic structure > needn't grow it. That said, now that I've taken a swipe at implementing this, I need more information than just the crash counter. We need to multiply the crash counter with a reasonable estimate of the maximum number of individual writes that could occur between an i_version being incremented and that value making it to the backing store. IOW, given a write that bumps the i_version to X, how many more write calls could race in before X makes it to the platter? I took a SWAG and said 4k in an earlier email, but I don't really have a way to know, and that could vary wildly with different filesystems and storage. What I'd like to see is this in struct super_block: u32 s_version_offset; ...and then individual filesystems can calculate: crash_counter * max_number_of_writes and put the correct value in there at mount time.
On Fri, Sep 16, 2022 at 11:11:34AM -0400, Jeff Layton wrote: > On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote: > > On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote: > > > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote: > > > > > > If the answer is that 'all values change', then why store the crash > > > > > > counter in the inode at all? Why not just add it as an offset when > > > > > > you're generating the user-visible change attribute? > > > > > > > > > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset) > > > > > > I had suggested just hashing the crash counter with the file system's > > > on-disk i_version number, which is essentially what you are suggested. > > > > > > > > Yes, if we plan to ensure that all the change attrs change after a > > > > > crash, we can do that. > > > > > > > > > > So what would make sense for an offset? Maybe 2**12? One would hope that > > > > > there wouldn't be more than 4k increments before one of them made it to > > > > > disk. OTOH, maybe that can happen with teeny-tiny writes. > > > > > > > > Leave it up the to filesystem to decide. The VFS and/or NFSD should > > > > have not have part in calculating the i_version. It should be entirely > > > > in the filesystem - though support code could be provided if common > > > > patterns exist across filesystems. > > > > > > Oh, *heck* no. This parameter is for the NFS implementation to > > > decide, because it's NFS's caching algorithms which are at stake here. > > > > > > As a the file system maintainer, I had offered to make an on-disk > > > "crash counter" which would get updated when the journal had gotten > > > replayed, in addition to the on-disk i_version number. This will be > > > available for the Linux implementation of NFSD to use, but that's up > > > to *you* to decide how you want to use them. > > > > > > I was perfectly happy with hashing the crash counter and the i_version > > > because I had assumed that not *that* much stuff was going to be > > > cached, and so invalidating all of the caches in the unusual case > > > where there was a crash was acceptable. After all it's a !@#?!@ > > > cache. Caches sometimmes get invalidated. "That is the order of > > > things." (as Ramata'Klan once said in "Rocks and Shoals") > > > > > > But if people expect that multiple TB's of data is going to be stored; > > > that cache invalidation is unacceptable; and that a itsy-weeny chance > > > of false negative failures which might cause data corruption might be > > > acceptable tradeoff, hey, that's for the system which is providing > > > caching semantics to determine. > > > > > > PLEASE don't put this tradeoff on the file system authors; I would > > > much prefer to leave this tradeoff in the hands of the system which is > > > trying to do the caching. > > > > > > > Yeah, if we were designing this from scratch, I might agree with leaving > > more up to the filesystem, but the existing users all have pretty much > > the same needs. I'm going to plan to try to keep most of this in the > > common infrastructure defined in iversion.h. > > > > Ted, for the ext4 crash counter, what wordsize were you thinking? I > > doubt we'll be able to use much more than 32 bits so a larger integer is > > probably not worthwhile. There are several holes in struct super_block > > (at least on x86_64), so adding this field to the generic structure > > needn't grow it. > > That said, now that I've taken a swipe at implementing this, I need more > information than just the crash counter. We need to multiply the crash > counter with a reasonable estimate of the maximum number of individual > writes that could occur between an i_version being incremented and that > value making it to the backing store. > > IOW, given a write that bumps the i_version to X, how many more write > calls could race in before X makes it to the platter? I took a SWAG and > said 4k in an earlier email, but I don't really have a way to know, and > that could vary wildly with different filesystems and storage. > > What I'd like to see is this in struct super_block: > > u32 s_version_offset; u64 s_version_salt; > ...and then individual filesystems can calculate: > > crash_counter * max_number_of_writes > > and put the correct value in there at mount time. Other filesystems might not have a crash counter but have other information that can be substituted, like a mount counter or a global change sequence number that is guaranteed to increment from one mount to the next. Further, have you thought about what "max number of writes" might be in ten years time? e.g. what happens if a filesysetm as "max number of writes" being greater than 2^32? I mean, we already have machines out there running Linux with 64-128TB of physical RAM, so it's already practical to hold > 2^32 individual writes to a single inode that each bump i_version in memory.... So when we consider this sort of scale, the "crash counter * max writes" scheme largely falls apart because "max writes" is a really large number to begin with. We're going to be stuck with whatever algorithm is decided on for the foreseeable future, so we must recognise that _we've already overrun 32 bit counter schemes_ in terms of tracking "i_version changes in memory vs what we have on disk". Hence I really think that we should be leaving the implementation of the salt value to the individual filesysetms as different filesytsems are aimed at different use cases and so may not necessarily have to all care about the same things (like 2^32 bit max write overruns). All the high level VFS code then needs to do is add the two together: statx.change_attr = inode->i_version + sb->s_version_salt; Cheers, Dave.
On Mon, 2022-09-19 at 09:53 +1000, Dave Chinner wrote: > On Fri, Sep 16, 2022 at 11:11:34AM -0400, Jeff Layton wrote: > > On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote: > > > On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote: > > > > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote: > > > > > > > If the answer is that 'all values change', then why store the crash > > > > > > > counter in the inode at all? Why not just add it as an offset when > > > > > > > you're generating the user-visible change attribute? > > > > > > > > > > > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset) > > > > > > > > I had suggested just hashing the crash counter with the file system's > > > > on-disk i_version number, which is essentially what you are suggested. > > > > > > > > > > Yes, if we plan to ensure that all the change attrs change after a > > > > > > crash, we can do that. > > > > > > > > > > > > So what would make sense for an offset? Maybe 2**12? One would hope that > > > > > > there wouldn't be more than 4k increments before one of them made it to > > > > > > disk. OTOH, maybe that can happen with teeny-tiny writes. > > > > > > > > > > Leave it up the to filesystem to decide. The VFS and/or NFSD should > > > > > have not have part in calculating the i_version. It should be entirely > > > > > in the filesystem - though support code could be provided if common > > > > > patterns exist across filesystems. > > > > > > > > Oh, *heck* no. This parameter is for the NFS implementation to > > > > decide, because it's NFS's caching algorithms which are at stake here. > > > > > > > > As a the file system maintainer, I had offered to make an on-disk > > > > "crash counter" which would get updated when the journal had gotten > > > > replayed, in addition to the on-disk i_version number. This will be > > > > available for the Linux implementation of NFSD to use, but that's up > > > > to *you* to decide how you want to use them. > > > > > > > > I was perfectly happy with hashing the crash counter and the i_version > > > > because I had assumed that not *that* much stuff was going to be > > > > cached, and so invalidating all of the caches in the unusual case > > > > where there was a crash was acceptable. After all it's a !@#?!@ > > > > cache. Caches sometimmes get invalidated. "That is the order of > > > > things." (as Ramata'Klan once said in "Rocks and Shoals") > > > > > > > > But if people expect that multiple TB's of data is going to be stored; > > > > that cache invalidation is unacceptable; and that a itsy-weeny chance > > > > of false negative failures which might cause data corruption might be > > > > acceptable tradeoff, hey, that's for the system which is providing > > > > caching semantics to determine. > > > > > > > > PLEASE don't put this tradeoff on the file system authors; I would > > > > much prefer to leave this tradeoff in the hands of the system which is > > > > trying to do the caching. > > > > > > > > > > Yeah, if we were designing this from scratch, I might agree with leaving > > > more up to the filesystem, but the existing users all have pretty much > > > the same needs. I'm going to plan to try to keep most of this in the > > > common infrastructure defined in iversion.h. > > > > > > Ted, for the ext4 crash counter, what wordsize were you thinking? I > > > doubt we'll be able to use much more than 32 bits so a larger integer is > > > probably not worthwhile. There are several holes in struct super_block > > > (at least on x86_64), so adding this field to the generic structure > > > needn't grow it. > > > > That said, now that I've taken a swipe at implementing this, I need more > > information than just the crash counter. We need to multiply the crash > > counter with a reasonable estimate of the maximum number of individual > > writes that could occur between an i_version being incremented and that > > value making it to the backing store. > > > > IOW, given a write that bumps the i_version to X, how many more write > > calls could race in before X makes it to the platter? I took a SWAG and > > said 4k in an earlier email, but I don't really have a way to know, and > > that could vary wildly with different filesystems and storage. > > > > What I'd like to see is this in struct super_block: > > > > u32 s_version_offset; > > u64 s_version_salt; > IDK...it _is_ an offset since we're folding it in with addition, and it has a real meaning. Filesystems do need to be cognizant of that fact, I think. Also does anyone have a preference on doing this vs. a get_version_salt or get_version_offset sb operation? I figured the value should be mostly static so it'd be nice to avoid an operation for it. > > ...and then individual filesystems can calculate: > > > > crash_counter * max_number_of_writes > > > > and put the correct value in there at mount time. > > Other filesystems might not have a crash counter but have other > information that can be substituted, like a mount counter or a > global change sequence number that is guaranteed to increment from > one mount to the next. > The problem there is that you're going to cause the invalidation of all of the NFS client's cached regular files, even on clean server reboots. That's not a desirable outcome. > Further, have you thought about what "max number of writes" might > be in ten years time? e.g. what happens if a filesysetm as "max > number of writes" being greater than 2^32? I mean, we already have > machines out there running Linux with 64-128TB of physical RAM, so > it's already practical to hold > 2^32 individual writes to a single > inode that each bump i_version in memory.... > So when we consider this sort of scale, the "crash counter * max > writes" scheme largely falls apart because "max writes" is a really > large number to begin with. We're going to be stuck with whatever > algorithm is decided on for the foreseeable future, so we must > recognise that _we've already overrun 32 bit counter schemes_ in > terms of tracking "i_version changes in memory vs what we have on > disk". > > Hence I really think that we should be leaving the implementation of > the salt value to the individual filesysetms as different > filesytsems are aimed at different use cases and so may not > necessarily have to all care about the same things (like 2^32 bit > max write overruns). All the high level VFS code then needs to do > is add the two together: > > statx.change_attr = inode->i_version + sb->s_version_salt; > Yeah, I have thought about that. I was really hoping that file systems wouldn't leave so many ephemeral changes lying around before logging something. It's actually not as bad as it sounds. You'd need that number of inode changes in memory + queries of i_version, alternating. When there are no queries, nothing changes. But, the number of queries is hard to gauge too as it's very dependent on workload, hardware, etc. If the sky really is the limit on unlogged inode changes, then what do you suggest? One idea: We could try to kick off a write_inode in the background when the i_version gets halfway to the limit. Eventually the nfs server could just return NFS4ERR_DELAY on a GETATTR if it looked like the reported version was going to cross the threshold. It'd be ugly, but hopefully wouldn't happen much if things are tuned well. Tracking that info might be expensive though. We'd need at least another u64 field in struct inode for the latest on-disk version. Maybe we can keep that in the fs-specific part of the inode somehow so we don't need to grow generic struct inode?
On Mon, Sep 19, 2022 at 09:13:00AM -0400, Jeff Layton wrote: > On Mon, 2022-09-19 at 09:53 +1000, Dave Chinner wrote: > > On Fri, Sep 16, 2022 at 11:11:34AM -0400, Jeff Layton wrote: > > > On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote: > > > > On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote: > > > > > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote: > > > > > > > > If the answer is that 'all values change', then why store the crash > > > > > > > > counter in the inode at all? Why not just add it as an offset when > > > > > > > > you're generating the user-visible change attribute? > > > > > > > > > > > > > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset) > > > > > > > > > > I had suggested just hashing the crash counter with the file system's > > > > > on-disk i_version number, which is essentially what you are suggested. > > > > > > > > > > > > Yes, if we plan to ensure that all the change attrs change after a > > > > > > > crash, we can do that. > > > > > > > > > > > > > > So what would make sense for an offset? Maybe 2**12? One would hope that > > > > > > > there wouldn't be more than 4k increments before one of them made it to > > > > > > > disk. OTOH, maybe that can happen with teeny-tiny writes. > > > > > > > > > > > > Leave it up the to filesystem to decide. The VFS and/or NFSD should > > > > > > have not have part in calculating the i_version. It should be entirely > > > > > > in the filesystem - though support code could be provided if common > > > > > > patterns exist across filesystems. > > > > > > > > > > Oh, *heck* no. This parameter is for the NFS implementation to > > > > > decide, because it's NFS's caching algorithms which are at stake here. > > > > > > > > > > As a the file system maintainer, I had offered to make an on-disk > > > > > "crash counter" which would get updated when the journal had gotten > > > > > replayed, in addition to the on-disk i_version number. This will be > > > > > available for the Linux implementation of NFSD to use, but that's up > > > > > to *you* to decide how you want to use them. > > > > > > > > > > I was perfectly happy with hashing the crash counter and the i_version > > > > > because I had assumed that not *that* much stuff was going to be > > > > > cached, and so invalidating all of the caches in the unusual case > > > > > where there was a crash was acceptable. After all it's a !@#?!@ > > > > > cache. Caches sometimmes get invalidated. "That is the order of > > > > > things." (as Ramata'Klan once said in "Rocks and Shoals") > > > > > > > > > > But if people expect that multiple TB's of data is going to be stored; > > > > > that cache invalidation is unacceptable; and that a itsy-weeny chance > > > > > of false negative failures which might cause data corruption might be > > > > > acceptable tradeoff, hey, that's for the system which is providing > > > > > caching semantics to determine. > > > > > > > > > > PLEASE don't put this tradeoff on the file system authors; I would > > > > > much prefer to leave this tradeoff in the hands of the system which is > > > > > trying to do the caching. > > > > > > > > > > > > > Yeah, if we were designing this from scratch, I might agree with leaving > > > > more up to the filesystem, but the existing users all have pretty much > > > > the same needs. I'm going to plan to try to keep most of this in the > > > > common infrastructure defined in iversion.h. > > > > > > > > Ted, for the ext4 crash counter, what wordsize were you thinking? I > > > > doubt we'll be able to use much more than 32 bits so a larger integer is > > > > probably not worthwhile. There are several holes in struct super_block > > > > (at least on x86_64), so adding this field to the generic structure > > > > needn't grow it. > > > > > > That said, now that I've taken a swipe at implementing this, I need more > > > information than just the crash counter. We need to multiply the crash > > > counter with a reasonable estimate of the maximum number of individual > > > writes that could occur between an i_version being incremented and that > > > value making it to the backing store. > > > > > > IOW, given a write that bumps the i_version to X, how many more write > > > calls could race in before X makes it to the platter? I took a SWAG and > > > said 4k in an earlier email, but I don't really have a way to know, and > > > that could vary wildly with different filesystems and storage. > > > > > > What I'd like to see is this in struct super_block: > > > > > > u32 s_version_offset; > > > > u64 s_version_salt; > > > > IDK...it _is_ an offset since we're folding it in with addition, and it > has a real meaning. Filesystems do need to be cognizant of that fact, I > think. > > Also does anyone have a preference on doing this vs. a get_version_salt > or get_version_offset sb operation? I figured the value should be mostly > static so it'd be nice to avoid an operation for it. > > > > ...and then individual filesystems can calculate: > > > > > > crash_counter * max_number_of_writes > > > > > > and put the correct value in there at mount time. > > > > Other filesystems might not have a crash counter but have other > > information that can be substituted, like a mount counter or a > > global change sequence number that is guaranteed to increment from > > one mount to the next. > > > > The problem there is that you're going to cause the invalidation of all > of the NFS client's cached regular files, even on clean server reboots. > That's not a desirable outcome. Stop saying "anything less than perfect is unacceptible". I *know* that changing the salt on every mount might result in less than perfect results, but the fact is that a -false negative- is a data corruption event, whilst a false positive is not. False positives may not be desirable, but false negatives are *not acceptible at all*. XFS can give you a guarantee of no false negatives right now with no on-disk format changes necessary, but it comes with the downside of false positives. That's not the end of the world, and it gives NFS the functionality it needs immediately and allows us time to add purpose-built on-disk functionality that gives NFS exactly what it wants. The reality is that this purpose-built on-disk change will take years to roll out to production systems, whilst using what we have now is just a kernel patch and upgrade away.... Changing on-disk metadata formats takes time, no matter how simple the change, and this timeframe is not something the NFS server actually controls. But there is a way for the NFS server to define and control it's own on-disk persistent metadata: *extended attributes*. How about we set a "crash" extended attribute on the root of an NFS export when the filesystem is exported, and then remove it when the filesystem is unexported. This gives the NFS server it's own persistent attribute that tells it whether the filesystem was *unexported* cleanly. If the exportfs code calls syncfs() before the xattr is removed, then it guarantees that everything the NFS clients have written and modified will be exactly present the next time the filesystem is exported. If the "crash" xattr is present when the filesystem is exported, then it wasn't cleanly synced before it was taken out of service, and so something may have been lost and the "crash counter" needs to be bumped. Yes, the "crash counter" is held in another xattr, so that it is persistent across crash and mount/unmount cycles. If the crash xattr is present, the NFSD reads, bumps and writes the crash counter xattr, and uses the new value for the life of that export. If the crash xattr is not present, then is just reads the counter xattr and uses it unchanged. IOWs, the NFS server can define it's own on-disk persistent metadata using xattrs, and you don't need local filesystems to be modified at all. You can add the crash epoch into the change attr that is sent to NFS clients without having to change the VFS i_version implementation at all. This whole problem is solvable entirely within the NFS server code, and we don't need to change local filesystems at all. NFS can control the persistence and format of the xattrs it uses, and it does not need new custom on-disk format changes from every filesystem to support this new application requirement. At this point, NFS server developers don't need to care what the underlying filesystem format provides - the xattrs provide the crash detection and enumeration the NFS server functionality requires. -Dave.
On Tue, 2022-09-20 at 10:16 +1000, Dave Chinner wrote: > On Mon, Sep 19, 2022 at 09:13:00AM -0400, Jeff Layton wrote: > > On Mon, 2022-09-19 at 09:53 +1000, Dave Chinner wrote: > > > On Fri, Sep 16, 2022 at 11:11:34AM -0400, Jeff Layton wrote: > > > > On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote: > > > > > On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote: > > > > > > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote: > > > > > > > > > If the answer is that 'all values change', then why store the crash > > > > > > > > > counter in the inode at all? Why not just add it as an offset when > > > > > > > > > you're generating the user-visible change attribute? > > > > > > > > > > > > > > > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset) > > > > > > > > > > > > I had suggested just hashing the crash counter with the file system's > > > > > > on-disk i_version number, which is essentially what you are suggested. > > > > > > > > > > > > > > Yes, if we plan to ensure that all the change attrs change after a > > > > > > > > crash, we can do that. > > > > > > > > > > > > > > > > So what would make sense for an offset? Maybe 2**12? One would hope that > > > > > > > > there wouldn't be more than 4k increments before one of them made it to > > > > > > > > disk. OTOH, maybe that can happen with teeny-tiny writes. > > > > > > > > > > > > > > Leave it up the to filesystem to decide. The VFS and/or NFSD should > > > > > > > have not have part in calculating the i_version. It should be entirely > > > > > > > in the filesystem - though support code could be provided if common > > > > > > > patterns exist across filesystems. > > > > > > > > > > > > Oh, *heck* no. This parameter is for the NFS implementation to > > > > > > decide, because it's NFS's caching algorithms which are at stake here. > > > > > > > > > > > > As a the file system maintainer, I had offered to make an on-disk > > > > > > "crash counter" which would get updated when the journal had gotten > > > > > > replayed, in addition to the on-disk i_version number. This will be > > > > > > available for the Linux implementation of NFSD to use, but that's up > > > > > > to *you* to decide how you want to use them. > > > > > > > > > > > > I was perfectly happy with hashing the crash counter and the i_version > > > > > > because I had assumed that not *that* much stuff was going to be > > > > > > cached, and so invalidating all of the caches in the unusual case > > > > > > where there was a crash was acceptable. After all it's a !@#?!@ > > > > > > cache. Caches sometimmes get invalidated. "That is the order of > > > > > > things." (as Ramata'Klan once said in "Rocks and Shoals") > > > > > > > > > > > > But if people expect that multiple TB's of data is going to be stored; > > > > > > that cache invalidation is unacceptable; and that a itsy-weeny chance > > > > > > of false negative failures which might cause data corruption might be > > > > > > acceptable tradeoff, hey, that's for the system which is providing > > > > > > caching semantics to determine. > > > > > > > > > > > > PLEASE don't put this tradeoff on the file system authors; I would > > > > > > much prefer to leave this tradeoff in the hands of the system which is > > > > > > trying to do the caching. > > > > > > > > > > > > > > > > Yeah, if we were designing this from scratch, I might agree with leaving > > > > > more up to the filesystem, but the existing users all have pretty much > > > > > the same needs. I'm going to plan to try to keep most of this in the > > > > > common infrastructure defined in iversion.h. > > > > > > > > > > Ted, for the ext4 crash counter, what wordsize were you thinking? I > > > > > doubt we'll be able to use much more than 32 bits so a larger integer is > > > > > probably not worthwhile. There are several holes in struct super_block > > > > > (at least on x86_64), so adding this field to the generic structure > > > > > needn't grow it. > > > > > > > > That said, now that I've taken a swipe at implementing this, I need more > > > > information than just the crash counter. We need to multiply the crash > > > > counter with a reasonable estimate of the maximum number of individual > > > > writes that could occur between an i_version being incremented and that > > > > value making it to the backing store. > > > > > > > > IOW, given a write that bumps the i_version to X, how many more write > > > > calls could race in before X makes it to the platter? I took a SWAG and > > > > said 4k in an earlier email, but I don't really have a way to know, and > > > > that could vary wildly with different filesystems and storage. > > > > > > > > What I'd like to see is this in struct super_block: > > > > > > > > u32 s_version_offset; > > > > > > u64 s_version_salt; > > > > > > > IDK...it _is_ an offset since we're folding it in with addition, and it > > has a real meaning. Filesystems do need to be cognizant of that fact, I > > think. > > > > Also does anyone have a preference on doing this vs. a get_version_salt > > or get_version_offset sb operation? I figured the value should be mostly > > static so it'd be nice to avoid an operation for it. > > > > > > ...and then individual filesystems can calculate: > > > > > > > > crash_counter * max_number_of_writes > > > > > > > > and put the correct value in there at mount time. > > > > > > Other filesystems might not have a crash counter but have other > > > information that can be substituted, like a mount counter or a > > > global change sequence number that is guaranteed to increment from > > > one mount to the next. > > > > > > > The problem there is that you're going to cause the invalidation of all > > of the NFS client's cached regular files, even on clean server reboots. > > That's not a desirable outcome. > > Stop saying "anything less than perfect is unacceptible". I *know* > that changing the salt on every mount might result in less than > perfect results, but the fact is that a -false negative- is a data > corruption event, whilst a false positive is not. False positives > may not be desirable, but false negatives are *not acceptible at > all*. > > XFS can give you a guarantee of no false negatives right now with no > on-disk format changes necessary, but it comes with the downside of > false positives. That's not the end of the world, and it gives NFS > the functionality it needs immediately and allows us time to add > purpose-built on-disk functionality that gives NFS exactly what it > wants. The reality is that this purpose-built on-disk change will > take years to roll out to production systems, whilst using what we > have now is just a kernel patch and upgrade away.... > > Changing on-disk metadata formats takes time, no matter how simple > the change, and this timeframe is not something the NFS server > actually controls. > > But there is a way for the NFS server to define and control it's own > on-disk persistent metadata: *extended attributes*. > > How about we set a "crash" extended attribute on the root of an NFS > export when the filesystem is exported, and then remove it when the > filesystem is unexported. > > This gives the NFS server it's own persistent attribute that tells > it whether the filesystem was *unexported* cleanly. If the exportfs > code calls syncfs() before the xattr is removed, then it guarantees > that everything the NFS clients have written and modified will be > exactly present the next time the filesystem is exported. If the > "crash" xattr is present when the filesystem is exported, then it > wasn't cleanly synced before it was taken out of service, and so > something may have been lost and the "crash counter" needs to be > bumped. > > Yes, the "crash counter" is held in another xattr, so that it is > persistent across crash and mount/unmount cycles. If the crash > xattr is present, the NFSD reads, bumps and writes the crash counter > xattr, and uses the new value for the life of that export. If the > crash xattr is not present, then is just reads the counter xattr and > uses it unchanged. > > IOWs, the NFS server can define it's own on-disk persistent metadata > using xattrs, and you don't need local filesystems to be modified at > all. You can add the crash epoch into the change attr that is sent > to NFS clients without having to change the VFS i_version > implementation at all. > > This whole problem is solvable entirely within the NFS server code, > and we don't need to change local filesystems at all. NFS can > control the persistence and format of the xattrs it uses, and it > does not need new custom on-disk format changes from every > filesystem to support this new application requirement. > > At this point, NFS server developers don't need to care what the > underlying filesystem format provides - the xattrs provide the crash > detection and enumeration the NFS server functionality requires. > Doesn't the filesystem already detect when it's been mounted after an unclean shutdown? I'm not sure what good we'll get out of bolting this scheme onto the NFS server, when the filesystem could just as easily give us this info. In any case, the main problem at this point is not so much in detecting when there has been an unclean shutdown, but rather what to do when there is one. We need to to advance the presented change attributes beyond the largest possible one that may have been handed out prior to the crash. How do we determine what that offset should be? Your last email suggested that there really is no limit to the number of i_version bumps that can happen in memory before one of them makes it to disk. What can we do to address that?
On Tue, Sep 20, 2022 at 06:26:05AM -0400, Jeff Layton wrote: > On Tue, 2022-09-20 at 10:16 +1000, Dave Chinner wrote: > > IOWs, the NFS server can define it's own on-disk persistent metadata > > using xattrs, and you don't need local filesystems to be modified at > > all. You can add the crash epoch into the change attr that is sent > > to NFS clients without having to change the VFS i_version > > implementation at all. > > > > This whole problem is solvable entirely within the NFS server code, > > and we don't need to change local filesystems at all. NFS can > > control the persistence and format of the xattrs it uses, and it > > does not need new custom on-disk format changes from every > > filesystem to support this new application requirement. > > > > At this point, NFS server developers don't need to care what the > > underlying filesystem format provides - the xattrs provide the crash > > detection and enumeration the NFS server functionality requires. > > > > Doesn't the filesystem already detect when it's been mounted after an > unclean shutdown? Not every filesystem will be able to guarantee unclean shutdown detection at the next mount. That's the whole problem - NFS developers are asking for something that cannot be provided as generic functionality by individual filesystems, so the NFS server application is going to have to work around any filesytem that cannot provide the information it needs. e.g. ext4 has it journal replayed by the userspace tools prior to mount, so when it then gets mounted by the kernel it's seen as a clean mount. If we shut an XFS filesystem down due to a filesystem corruption or failed IO to the journal code, the kernel might not be able to replay the journal on mount (i.e. it is corrupt). We then run xfs_repair, and that fixes the corruption issue and -cleans the log-. When we next mount the filesystem, it results in a _clean mount_, and the kernel filesystem code can not signal to NFS that an unclean mount occurred and so it should bump it's crash counter. IOWs, this whole "filesystems need to tell NFS about crashes" propagates all the way through *every filesystem tool chain*, not just the kernel mount code. And we most certainly don't control every 3rd party application that walks around in the filesystem on disk format, and so there are -zero- guarantees that the kernel filesystem mount code can give that an unclean shutdown occurred prior to the current mount. And then for niche NFS server applications (like transparent fail-over between HA NFS servers) there are even more rigid constraints on NFS change attributes. And you're asking local filesystems to know about these application constraints and bake them into their on-disk format again. This whole discussion has come about because we baked certain behaviour for NFS into the on-disk format many, many years ago, and it's only now that it is considered inadequate for *new* NFS application related functionality (e.g. fscache integration and cache validity across server side mount cycles). We've learnt a valuable lesson from this: don't bake application specific persistent metadata requirements into the on-disk format because when the application needs to change, it requires every filesystem that supports taht application level functionality to change their on-disk formats... > I'm not sure what good we'll get out of bolting this > scheme onto the NFS server, when the filesystem could just as easily > give us this info. The xattr scheme guarantees the correct application behaviour that the NFS server requires, all at the NFS application level without requiring local filesystems to support the NFS requirements in their on-disk format. THe NFS server controls the format and versioning of it's on-disk persistent metadata (i.e. the xattrs it uses) and so any changes to the application level requirements of that functionality are now completely under the control of the application. i.e. the application gets to manage version control, backwards and forwards compatibility of it's persistent metadata, etc. What you are asking is that every local filesystem takes responsibility for managing the long term persistent metadata that only NFS requires. It's more complex to do this at the filesystem level, and we have to replicate the same work for every filesystem that is going to support this on-disk functionality. Using xattrs means the functionality is implemented once, it's common across all local filesystems, and no exportable filesystem needs to know anything about it as it's all self-contained in the NFS server code. THe code is smaller, easier to maintain, consistent across all systems, easy to test, etc. It also can be implemented and rolled out *immediately* to all existing supported NFS server implementations, without having to wait months/years (or never!) for local filesystem on-disk format changes to roll out to production systems. Asking individual filesystems to implement application specific persistent metadata is a *last resort* and should only be done if correctness or performance cannot be obtained in any other way. So, yeah, the only sane direction to take here is to use xattrs to store this NFS application level information. It's less work for everyone, and in the long term it means when the NFS application requirements change again, we don't need to modify the on-disk format of multiple local filesystems. > In any case, the main problem at this point is not so much in detecting > when there has been an unclean shutdown, but rather what to do when > there is one. We need to to advance the presented change attributes > beyond the largest possible one that may have been handed out prior to > the crash. Sure, but you're missing my point: by using xattrs for detection, you don't need to involve anything to do with local filesystems at all. > How do we determine what that offset should be? Your last email > suggested that there really is no limit to the number of i_version bumps > that can happen in memory before one of them makes it to disk. What can > we do to address that? <shrug> I'm just pointing out problems I see when defining this as behaviour for on-disk format purposes. If we define it as part of the on-disk format, then we have to be concerned about how it may be used outside the scope of just the NFS server application. However, If NFS keeps this metadata and functionaly entirely contained at the application level via xattrs, I really don't care what algorithm NFS developers decides to use for their crash sequencing. It's not my concern at this point, and that's precisely why NFS should be using xattrs for this NFS specific functionality. -Dave.
On Wed, 2022-09-21 at 10:00 +1000, Dave Chinner wrote: > On Tue, Sep 20, 2022 at 06:26:05AM -0400, Jeff Layton wrote: > > On Tue, 2022-09-20 at 10:16 +1000, Dave Chinner wrote: > > > IOWs, the NFS server can define it's own on-disk persistent metadata > > > using xattrs, and you don't need local filesystems to be modified at > > > all. You can add the crash epoch into the change attr that is sent > > > to NFS clients without having to change the VFS i_version > > > implementation at all. > > > > > > This whole problem is solvable entirely within the NFS server code, > > > and we don't need to change local filesystems at all. NFS can > > > control the persistence and format of the xattrs it uses, and it > > > does not need new custom on-disk format changes from every > > > filesystem to support this new application requirement. > > > > > > At this point, NFS server developers don't need to care what the > > > underlying filesystem format provides - the xattrs provide the crash > > > detection and enumeration the NFS server functionality requires. > > > > > > > Doesn't the filesystem already detect when it's been mounted after an > > unclean shutdown? > > Not every filesystem will be able to guarantee unclean shutdown > detection at the next mount. That's the whole problem - NFS > developers are asking for something that cannot be provided as > generic functionality by individual filesystems, so the NFS server > application is going to have to work around any filesytem that > cannot provide the information it needs. > > e.g. ext4 has it journal replayed by the userspace tools prior > to mount, so when it then gets mounted by the kernel it's seen as a > clean mount. > > If we shut an XFS filesystem down due to a filesystem corruption or > failed IO to the journal code, the kernel might not be able to > replay the journal on mount (i.e. it is corrupt). We then run > xfs_repair, and that fixes the corruption issue and -cleans the > log-. When we next mount the filesystem, it results in a _clean > mount_, and the kernel filesystem code can not signal to NFS that an > unclean mount occurred and so it should bump it's crash counter. > > IOWs, this whole "filesystems need to tell NFS about crashes" > propagates all the way through *every filesystem tool chain*, not > just the kernel mount code. And we most certainly don't control > every 3rd party application that walks around in the filesystem on > disk format, and so there are -zero- guarantees that the kernel > filesystem mount code can give that an unclean shutdown occurred > prior to the current mount. > > And then for niche NFS server applications (like transparent > fail-over between HA NFS servers) there are even more rigid > constraints on NFS change attributes. And you're asking local > filesystems to know about these application constraints and bake > them into their on-disk format again. > > This whole discussion has come about because we baked certain > behaviour for NFS into the on-disk format many, many years ago, and > it's only now that it is considered inadequate for *new* NFS > application related functionality (e.g. fscache integration and > cache validity across server side mount cycles). > > We've learnt a valuable lesson from this: don't bake application > specific persistent metadata requirements into the on-disk format > because when the application needs to change, it requires every > filesystem that supports taht application level functionality > to change their on-disk formats... > > > I'm not sure what good we'll get out of bolting this > > scheme onto the NFS server, when the filesystem could just as easily > > give us this info. > > The xattr scheme guarantees the correct application behaviour that the NFS > server requires, all at the NFS application level without requiring > local filesystems to support the NFS requirements in their on-disk > format. THe NFS server controls the format and versioning of it's > on-disk persistent metadata (i.e. the xattrs it uses) and so any > changes to the application level requirements of that functionality > are now completely under the control of the application. > > i.e. the application gets to manage version control, backwards and > forwards compatibility of it's persistent metadata, etc. What you > are asking is that every local filesystem takes responsibility for > managing the long term persistent metadata that only NFS requires. > It's more complex to do this at the filesystem level, and we have to > replicate the same work for every filesystem that is going to > support this on-disk functionality. > > Using xattrs means the functionality is implemented once, it's > common across all local filesystems, and no exportable filesystem > needs to know anything about it as it's all self-contained in the > NFS server code. THe code is smaller, easier to maintain, consistent > across all systems, easy to test, etc. > > It also can be implemented and rolled out *immediately* to all > existing supported NFS server implementations, without having to > wait months/years (or never!) for local filesystem on-disk format > changes to roll out to production systems. > > Asking individual filesystems to implement application specific > persistent metadata is a *last resort* and should only be done if > correctness or performance cannot be obtained in any other way. > > So, yeah, the only sane direction to take here is to use xattrs to > store this NFS application level information. It's less work for > everyone, and in the long term it means when the NFS application > requirements change again, we don't need to modify the on-disk > format of multiple local filesystems. > > > In any case, the main problem at this point is not so much in detecting > > when there has been an unclean shutdown, but rather what to do when > > there is one. We need to to advance the presented change attributes > > beyond the largest possible one that may have been handed out prior to > > the crash. > > Sure, but you're missing my point: by using xattrs for detection, > you don't need to involve anything to do with local filesystems at > all. > > > How do we determine what that offset should be? Your last email > > suggested that there really is no limit to the number of i_version bumps > > that can happen in memory before one of them makes it to disk. What can > > we do to address that? > > <shrug> > > I'm just pointing out problems I see when defining this as behaviour > for on-disk format purposes. If we define it as part of the on-disk > format, then we have to be concerned about how it may be used > outside the scope of just the NFS server application. > > However, If NFS keeps this metadata and functionaly entirely > contained at the application level via xattrs, I really don't care > what algorithm NFS developers decides to use for their crash > sequencing. It's not my concern at this point, and that's precisely > why NFS should be using xattrs for this NFS specific functionality. > I get it: you'd rather not have to deal with what you see as an NFS problem, but I don't get how what you're proposing solves anything. We might be able to use that scheme to detect crashes, but that's only part of the problem (and it's a relatively simple part of the problem to solve, really). Maybe you can clarify it for me: Suppose we go with what you're saying and store some information in xattrs that allows us to detect crashes in some fashion. The server crashes and comes back up and we detect that there was a crash earlier. What does nfsd need to do now to ensure that it doesn't hand out a duplicate change attribute? Until we can answer that question, detecting crashes doesn't matter.
On Wed, Sep 21, 2022 at 06:33:28AM -0400, Jeff Layton wrote: > On Wed, 2022-09-21 at 10:00 +1000, Dave Chinner wrote: > > > How do we determine what that offset should be? Your last email > > > suggested that there really is no limit to the number of i_version bumps > > > that can happen in memory before one of them makes it to disk. What can > > > we do to address that? > > > > <shrug> > > > > I'm just pointing out problems I see when defining this as behaviour > > for on-disk format purposes. If we define it as part of the on-disk > > format, then we have to be concerned about how it may be used > > outside the scope of just the NFS server application. > > > > However, If NFS keeps this metadata and functionaly entirely > > contained at the application level via xattrs, I really don't care > > what algorithm NFS developers decides to use for their crash > > sequencing. It's not my concern at this point, and that's precisely > > why NFS should be using xattrs for this NFS specific functionality. > > > > I get it: you'd rather not have to deal with what you see as an NFS > problem, but I don't get how what you're proposing solves anything. We > might be able to use that scheme to detect crashes, but that's only part > of the problem (and it's a relatively simple part of the problem to > solve, really). > > Maybe you can clarify it for me: > > Suppose we go with what you're saying and store some information in > xattrs that allows us to detect crashes in some fashion. The server > crashes and comes back up and we detect that there was a crash earlier. > > What does nfsd need to do now to ensure that it doesn't hand out a > duplicate change attribute? As I've already stated, the NFS server can hold the persistent NFS crash counter value in a second xattr that it bumps whenever it detects a crash and hence we take the local filesystem completely out of the equation. How the crash counter is then used by the nfsd to fold it into the NFS protocol change attribute is a nfsd problem, not a local filesystem problem. If you're worried about maximum number of writes outstanding vs i_version bumps that are held in memory, then *bound the maximum number of uncommitted i_version changes that the NFS server will allow to build up in memory*. By moving the crash counter to being a NFS server only function, the NFS server controls the entire algorithm and it doesn't have to care about external 3rd party considerations like local filesystems have to. e.g. The NFS server can track the i_version values when the NFSD syncs/commits a given inode. The nfsd can sample i_version it when calls ->commit_metadata or flushed data on the inode, and then when it peeks at i_version when gathering post-op attrs (or any other getattr op) it can decide that there is too much in-memory change (e.g. 10,000 counts since last sync) and sync the inode. i.e. the NFS server can trivially cap the maximum number of uncommitted NFS change attr bumps it allows to build up in memory. At that point, the NFS server has a bound "maximum write count" that can be used in conjunction with the xattr based crash counter to determine how the change_attr is bumped by the crash counter. -Dave.
On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote: > On Wed, Sep 21, 2022 at 06:33:28AM -0400, Jeff Layton wrote: > > On Wed, 2022-09-21 at 10:00 +1000, Dave Chinner wrote: > > > > How do we determine what that offset should be? Your last email > > > > suggested that there really is no limit to the number of i_version bumps > > > > that can happen in memory before one of them makes it to disk. What can > > > > we do to address that? > > > > > > <shrug> > > > > > > I'm just pointing out problems I see when defining this as behaviour > > > for on-disk format purposes. If we define it as part of the on-disk > > > format, then we have to be concerned about how it may be used > > > outside the scope of just the NFS server application. > > > > > > However, If NFS keeps this metadata and functionaly entirely > > > contained at the application level via xattrs, I really don't care > > > what algorithm NFS developers decides to use for their crash > > > sequencing. It's not my concern at this point, and that's precisely > > > why NFS should be using xattrs for this NFS specific functionality. > > > > > > > I get it: you'd rather not have to deal with what you see as an NFS > > problem, but I don't get how what you're proposing solves anything. We > > might be able to use that scheme to detect crashes, but that's only part > > of the problem (and it's a relatively simple part of the problem to > > solve, really). > > > > Maybe you can clarify it for me: > > > > Suppose we go with what you're saying and store some information in > > xattrs that allows us to detect crashes in some fashion. The server > > crashes and comes back up and we detect that there was a crash earlier. > > > > What does nfsd need to do now to ensure that it doesn't hand out a > > duplicate change attribute? > > As I've already stated, the NFS server can hold the persistent NFS > crash counter value in a second xattr that it bumps whenever it > detects a crash and hence we take the local filesystem completely > out of the equation. How the crash counter is then used by the nfsd > to fold it into the NFS protocol change attribute is a nfsd problem, > not a local filesystem problem. > Ok, assuming you mean put this in an xattr that lives at the root of the export? We only need this for IS_I_VERSION filesystems (btrfs, xfs, and ext4), and they all support xattrs so this scheme should work. > If you're worried about maximum number of writes outstanding vs > i_version bumps that are held in memory, then *bound the maximum > number of uncommitted i_version changes that the NFS server will > allow to build up in memory*. By moving the crash counter to being a > NFS server only function, the NFS server controls the entire > algorithm and it doesn't have to care about external 3rd party > considerations like local filesystems have to. > Yeah, this is the bigger consideration. > e.g. The NFS server can track the i_version values when the NFSD > syncs/commits a given inode. The nfsd can sample i_version it when > calls ->commit_metadata or flushed data on the inode, and then when > it peeks at i_version when gathering post-op attrs (or any other > getattr op) it can decide that there is too much in-memory change > (e.g. 10,000 counts since last sync) and sync the inode. > > i.e. the NFS server can trivially cap the maximum number of > uncommitted NFS change attr bumps it allows to build up in memory. > At that point, the NFS server has a bound "maximum write count" that > can be used in conjunction with the xattr based crash counter to > determine how the change_attr is bumped by the crash counter. Well, not "trivially". This is the bit where we have to grow struct inode (or the fs-specific inode), as we'll need to know what the latest on-disk value is for the inode. I'm leaning toward doing this on the query side. Basically, when nfsd goes to query the i_version, it'll check the delta between the current version and the latest one on disk. If it's bigger than X then we'd just return NFS4ERR_DELAY to the client. If the delta is >X/2, maybe it can kick off a workqueue job or something that calls write_inode with WB_SYNC_ALL to try to get the thing onto the platter ASAP.
On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote: > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote: > > On Wed, Sep 21, 2022 at 06:33:28AM -0400, Jeff Layton wrote: > > > On Wed, 2022-09-21 at 10:00 +1000, Dave Chinner wrote: > > > > > How do we determine what that offset should be? Your last email > > > > > suggested that there really is no limit to the number of i_version bumps > > > > > that can happen in memory before one of them makes it to disk. What can > > > > > we do to address that? > > > > > > > > <shrug> > > > > > > > > I'm just pointing out problems I see when defining this as behaviour > > > > for on-disk format purposes. If we define it as part of the on-disk > > > > format, then we have to be concerned about how it may be used > > > > outside the scope of just the NFS server application. > > > > > > > > However, If NFS keeps this metadata and functionaly entirely > > > > contained at the application level via xattrs, I really don't care > > > > what algorithm NFS developers decides to use for their crash > > > > sequencing. It's not my concern at this point, and that's precisely > > > > why NFS should be using xattrs for this NFS specific functionality. > > > > > > > > > > I get it: you'd rather not have to deal with what you see as an NFS > > > problem, but I don't get how what you're proposing solves anything. We > > > might be able to use that scheme to detect crashes, but that's only part > > > of the problem (and it's a relatively simple part of the problem to > > > solve, really). > > > > > > Maybe you can clarify it for me: > > > > > > Suppose we go with what you're saying and store some information in > > > xattrs that allows us to detect crashes in some fashion. The server > > > crashes and comes back up and we detect that there was a crash earlier. > > > > > > What does nfsd need to do now to ensure that it doesn't hand out a > > > duplicate change attribute? > > > > As I've already stated, the NFS server can hold the persistent NFS > > crash counter value in a second xattr that it bumps whenever it > > detects a crash and hence we take the local filesystem completely > > out of the equation. How the crash counter is then used by the nfsd > > to fold it into the NFS protocol change attribute is a nfsd problem, > > not a local filesystem problem. > > > > Ok, assuming you mean put this in an xattr that lives at the root of the > export? We only need this for IS_I_VERSION filesystems (btrfs, xfs, and > ext4), and they all support xattrs so this scheme should work. > I had a look at this today and it's not as straightforward as it sounds. In particular, there is no guarantee that an export will not cross filesystem boundaries. Also, nfsd and mountd are very much "demand driven". We might not touch an exported filesystem at all if nothing asks for it. Ensuring we can do something to every exported filesystem after a crash is more difficult than it sounds. So trying to do something with xattrs on the exported filesystems is probably not what we want. It's also sort of janky since we do strive to leave a "light footprint" on the exported filesystem. Maybe we don't need that though. Chuck reminded me that nfsdcltrack could be used here instead. We can punt this to userland! nfsdcltrack could keep track of a global crash "salt", and feed that to nfsd when it starts up. When starting a grace period, it can set a RUNNING flag in the db. If it's set when the server starts, we know there was a crash and can bump the crash counter. When nfsd is shutting down cleanly, it can call sync() and then clear the flag (this may require a new cld upcall cmd). We then mix that value into the change attribute for IS_I_VERSION inodes. That's probably good enough for nfsd, but if we wanted to present this to userland via statx, we'd need a different mechanism. For now, I'm going to plan to fix this up in nfsd and then we'll see where we are. > > If you're worried about maximum number of writes outstanding vs > > i_version bumps that are held in memory, then *bound the maximum > > number of uncommitted i_version changes that the NFS server will > > allow to build up in memory*. By moving the crash counter to being a > > NFS server only function, the NFS server controls the entire > > algorithm and it doesn't have to care about external 3rd party > > considerations like local filesystems have to. > > > > Yeah, this is the bigger consideration. > > > e.g. The NFS server can track the i_version values when the NFSD > > syncs/commits a given inode. The nfsd can sample i_version it when > > calls ->commit_metadata or flushed data on the inode, and then when > > it peeks at i_version when gathering post-op attrs (or any other > > getattr op) it can decide that there is too much in-memory change > > (e.g. 10,000 counts since last sync) and sync the inode. > > > > i.e. the NFS server can trivially cap the maximum number of > > uncommitted NFS change attr bumps it allows to build up in memory. > > At that point, the NFS server has a bound "maximum write count" that > > can be used in conjunction with the xattr based crash counter to > > determine how the change_attr is bumped by the crash counter. > > Well, not "trivially". This is the bit where we have to grow struct > inode (or the fs-specific inode), as we'll need to know what the latest > on-disk value is for the inode. > > I'm leaning toward doing this on the query side. Basically, when nfsd > goes to query the i_version, it'll check the delta between the current > version and the latest one on disk. If it's bigger than X then we'd just > return NFS4ERR_DELAY to the client. > > If the delta is >X/2, maybe it can kick off a workqueue job or something > that calls write_inode with WB_SYNC_ALL to try to get the thing onto the > platter ASAP. Still looking at this bit too. Probably we can just kick off a WB_SYNC_NONE filemap_fdatawrite at that point and hope for the best?
On Thu 22-09-22 16:18:02, Jeff Layton wrote: > On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote: > > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote: > > > e.g. The NFS server can track the i_version values when the NFSD > > > syncs/commits a given inode. The nfsd can sample i_version it when > > > calls ->commit_metadata or flushed data on the inode, and then when > > > it peeks at i_version when gathering post-op attrs (or any other > > > getattr op) it can decide that there is too much in-memory change > > > (e.g. 10,000 counts since last sync) and sync the inode. > > > > > > i.e. the NFS server can trivially cap the maximum number of > > > uncommitted NFS change attr bumps it allows to build up in memory. > > > At that point, the NFS server has a bound "maximum write count" that > > > can be used in conjunction with the xattr based crash counter to > > > determine how the change_attr is bumped by the crash counter. > > > > Well, not "trivially". This is the bit where we have to grow struct > > inode (or the fs-specific inode), as we'll need to know what the latest > > on-disk value is for the inode. > > > > I'm leaning toward doing this on the query side. Basically, when nfsd > > goes to query the i_version, it'll check the delta between the current > > version and the latest one on disk. If it's bigger than X then we'd just > > return NFS4ERR_DELAY to the client. > > > > If the delta is >X/2, maybe it can kick off a workqueue job or something > > that calls write_inode with WB_SYNC_ALL to try to get the thing onto the > > platter ASAP. > > Still looking at this bit too. Probably we can just kick off a > WB_SYNC_NONE filemap_fdatawrite at that point and hope for the best? "Hope" is not a great assurance regarding data integrity ;) Anyway, it depends on how you imagine the "i_version on disk" is going to be maintained. It could be maintained by NFSD inside commit_inode_metadata() - fetch current i_version value before asking filesystem for the sync and by the time commit_metadata() returns we know that value is on disk. If we detect the current - on_disk is > X/2, we call commit_inode_metadata() and we are done. It is not even *that* expensive because usually filesystems optimize away unnecessary IO when the inode didn't change since last time it got synced. Honza
On Fri, 2022-09-23 at 11:56 +0200, Jan Kara wrote: > On Thu 22-09-22 16:18:02, Jeff Layton wrote: > > On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote: > > > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote: > > > > e.g. The NFS server can track the i_version values when the NFSD > > > > syncs/commits a given inode. The nfsd can sample i_version it when > > > > calls ->commit_metadata or flushed data on the inode, and then when > > > > it peeks at i_version when gathering post-op attrs (or any other > > > > getattr op) it can decide that there is too much in-memory change > > > > (e.g. 10,000 counts since last sync) and sync the inode. > > > > > > > > i.e. the NFS server can trivially cap the maximum number of > > > > uncommitted NFS change attr bumps it allows to build up in memory. > > > > At that point, the NFS server has a bound "maximum write count" that > > > > can be used in conjunction with the xattr based crash counter to > > > > determine how the change_attr is bumped by the crash counter. > > > > > > Well, not "trivially". This is the bit where we have to grow struct > > > inode (or the fs-specific inode), as we'll need to know what the latest > > > on-disk value is for the inode. > > > > > > I'm leaning toward doing this on the query side. Basically, when nfsd > > > goes to query the i_version, it'll check the delta between the current > > > version and the latest one on disk. If it's bigger than X then we'd just > > > return NFS4ERR_DELAY to the client. > > > > > > If the delta is >X/2, maybe it can kick off a workqueue job or something > > > that calls write_inode with WB_SYNC_ALL to try to get the thing onto the > > > platter ASAP. > > > > Still looking at this bit too. Probably we can just kick off a > > WB_SYNC_NONE filemap_fdatawrite at that point and hope for the best? > > "Hope" is not a great assurance regarding data integrity ;) By "hoping for the best", I meant hoping that we never have to take the the drastic action of returning NFS4ERR_DELAY on GETATTR operations. We definitely don't want to jeopardize data integrity. > Anyway, it > depends on how you imagine the "i_version on disk" is going to be > maintained. It could be maintained by NFSD inside commit_inode_metadata() - > fetch current i_version value before asking filesystem for the sync and by the > time commit_metadata() returns we know that value is on disk. If we detect the > current - on_disk is > X/2, we call commit_inode_metadata() and we are > done. It is not even *that* expensive because usually filesystems optimize > away unnecessary IO when the inode didn't change since last time it got > synced. > At >X/2 we don't really want to start blocking or anything. I'd prefer if we could kick off something in the background for this, but if it's not too expensive then maybe just calling commit_inode_metadata synchronously in this codepath is OK. Alternately, we could consider doing that in a workqueue job too. I need to do a bit more research here, but I think we have some options.
On Fri, 2022-09-23 at 11:56 +0200, Jan Kara wrote: > On Thu 22-09-22 16:18:02, Jeff Layton wrote: > > On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote: > > > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote: > > > > e.g. The NFS server can track the i_version values when the > > > > NFSD > > > > syncs/commits a given inode. The nfsd can sample i_version it > > > > when > > > > calls ->commit_metadata or flushed data on the inode, and then > > > > when > > > > it peeks at i_version when gathering post-op attrs (or any > > > > other > > > > getattr op) it can decide that there is too much in-memory > > > > change > > > > (e.g. 10,000 counts since last sync) and sync the inode. > > > > > > > > i.e. the NFS server can trivially cap the maximum number of > > > > uncommitted NFS change attr bumps it allows to build up in > > > > memory. > > > > At that point, the NFS server has a bound "maximum write count" > > > > that > > > > can be used in conjunction with the xattr based crash counter > > > > to > > > > determine how the change_attr is bumped by the crash counter. > > > > > > Well, not "trivially". This is the bit where we have to grow > > > struct > > > inode (or the fs-specific inode), as we'll need to know what the > > > latest > > > on-disk value is for the inode. > > > > > > I'm leaning toward doing this on the query side. Basically, when > > > nfsd > > > goes to query the i_version, it'll check the delta between the > > > current > > > version and the latest one on disk. If it's bigger than X then > > > we'd just > > > return NFS4ERR_DELAY to the client. > > > > > > If the delta is >X/2, maybe it can kick off a workqueue job or > > > something > > > that calls write_inode with WB_SYNC_ALL to try to get the thing > > > onto the > > > platter ASAP. > > > > Still looking at this bit too. Probably we can just kick off a > > WB_SYNC_NONE filemap_fdatawrite at that point and hope for the > > best? > > "Hope" is not a great assurance regarding data integrity ;) Anyway, > it > depends on how you imagine the "i_version on disk" is going to be > maintained. It could be maintained by NFSD inside > commit_inode_metadata() - > fetch current i_version value before asking filesystem for the sync > and by the > time commit_metadata() returns we know that value is on disk. If we > detect the > current - on_disk is > X/2, we call commit_inode_metadata() and we > are > done. It is not even *that* expensive because usually filesystems > optimize > away unnecessary IO when the inode didn't change since last time it > got > synced. > Note that these approaches requiring 3rd party help in order to track i_version integrity across filesystem crashes all make the idea of adding i_version to statx() a no-go. It is one thing for knfsd to add specialised machinery for integrity checking, but if all applications need to do so, then they are highly unlikely to want to adopt this attribute.
On Fri, 2022-09-23 at 13:44 +0000, Trond Myklebust wrote: > On Fri, 2022-09-23 at 11:56 +0200, Jan Kara wrote: > > On Thu 22-09-22 16:18:02, Jeff Layton wrote: > > > On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote: > > > > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote: > > > > > e.g. The NFS server can track the i_version values when the > > > > > NFSD > > > > > syncs/commits a given inode. The nfsd can sample i_version it > > > > > when > > > > > calls ->commit_metadata or flushed data on the inode, and then > > > > > when > > > > > it peeks at i_version when gathering post-op attrs (or any > > > > > other > > > > > getattr op) it can decide that there is too much in-memory > > > > > change > > > > > (e.g. 10,000 counts since last sync) and sync the inode. > > > > > > > > > > i.e. the NFS server can trivially cap the maximum number of > > > > > uncommitted NFS change attr bumps it allows to build up in > > > > > memory. > > > > > At that point, the NFS server has a bound "maximum write count" > > > > > that > > > > > can be used in conjunction with the xattr based crash counter > > > > > to > > > > > determine how the change_attr is bumped by the crash counter. > > > > > > > > Well, not "trivially". This is the bit where we have to grow > > > > struct > > > > inode (or the fs-specific inode), as we'll need to know what the > > > > latest > > > > on-disk value is for the inode. > > > > > > > > I'm leaning toward doing this on the query side. Basically, when > > > > nfsd > > > > goes to query the i_version, it'll check the delta between the > > > > current > > > > version and the latest one on disk. If it's bigger than X then > > > > we'd just > > > > return NFS4ERR_DELAY to the client. > > > > > > > > If the delta is >X/2, maybe it can kick off a workqueue job or > > > > something > > > > that calls write_inode with WB_SYNC_ALL to try to get the thing > > > > onto the > > > > platter ASAP. > > > > > > Still looking at this bit too. Probably we can just kick off a > > > WB_SYNC_NONE filemap_fdatawrite at that point and hope for the > > > best? > > > > "Hope" is not a great assurance regarding data integrity ;) Anyway, > > it > > depends on how you imagine the "i_version on disk" is going to be > > maintained. It could be maintained by NFSD inside > > commit_inode_metadata() - > > fetch current i_version value before asking filesystem for the sync > > and by the > > time commit_metadata() returns we know that value is on disk. If we > > detect the > > current - on_disk is > X/2, we call commit_inode_metadata() and we > > are > > done. It is not even *that* expensive because usually filesystems > > optimize > > away unnecessary IO when the inode didn't change since last time it > > got > > synced. > > > > Note that these approaches requiring 3rd party help in order to track > i_version integrity across filesystem crashes all make the idea of > adding i_version to statx() a no-go. > > It is one thing for knfsd to add specialised machinery for integrity > checking, but if all applications need to do so, then they are highly > unlikely to want to adopt this attribute. > > Absolutely. That is the downside of this approach, but the priority here has always been to improve nfsd. If we don't get the ability to present this info via statx, then so be it. Later on, I suppose we can move that handling into the kernel in some fashion if we decide it's worthwhile. That said, not having this in statx makes it more difficult to test i_version behavior. Maybe we can add a generic ioctl for that in the interim?
> -----Original Message----- > From: Jeff Layton [mailto:jlayton@kernel.org] > Sent: Friday, September 23, 2022 6:50 AM > To: Trond Myklebust <trondmy@hammerspace.com>; jack@suse.cz > Cc: zohar@linux.ibm.com; djwong@kernel.org; brauner@kernel.org; linux- > xfs@vger.kernel.org; bfields@fieldses.org; linux-api@vger.kernel.org; > neilb@suse.de; david@fromorbit.com; fweimer@redhat.com; linux- > kernel@vger.kernel.org; chuck.lever@oracle.com; linux-man@vger.kernel.org; > linux-nfs@vger.kernel.org; linux-ext4@vger.kernel.org; tytso@mit.edu; > viro@zeniv.linux.org.uk; xiubli@redhat.com; linux-fsdevel@vger.kernel.org; > adilger.kernel@dilger.ca; lczerner@redhat.com; ceph-devel@vger.kernel.org; > linux-btrfs@vger.kernel.org > Subject: Re: [man-pages RFC PATCH v4] statx, inode: document the new > STATX_INO_VERSION field > > On Fri, 2022-09-23 at 13:44 +0000, Trond Myklebust wrote: > > On Fri, 2022-09-23 at 11:56 +0200, Jan Kara wrote: > > > On Thu 22-09-22 16:18:02, Jeff Layton wrote: > > > > On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote: > > > > > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote: > > > > > > e.g. The NFS server can track the i_version values when the > > > > > > NFSD syncs/commits a given inode. The nfsd can sample > > > > > > i_version it when calls ->commit_metadata or flushed data on > > > > > > the inode, and then when it peeks at i_version when gathering > > > > > > post-op attrs (or any other getattr op) it can decide that > > > > > > there is too much in-memory change (e.g. 10,000 counts since > > > > > > last sync) and sync the inode. > > > > > > > > > > > > i.e. the NFS server can trivially cap the maximum number of > > > > > > uncommitted NFS change attr bumps it allows to build up in > > > > > > memory. > > > > > > At that point, the NFS server has a bound "maximum write count" > > > > > > that > > > > > > can be used in conjunction with the xattr based crash counter > > > > > > to determine how the change_attr is bumped by the crash > > > > > > counter. > > > > > > > > > > Well, not "trivially". This is the bit where we have to grow > > > > > struct inode (or the fs-specific inode), as we'll need to know > > > > > what the latest on-disk value is for the inode. > > > > > > > > > > I'm leaning toward doing this on the query side. Basically, when > > > > > nfsd goes to query the i_version, it'll check the delta between > > > > > the current version and the latest one on disk. If it's bigger > > > > > than X then we'd just return NFS4ERR_DELAY to the client. > > > > > > > > > > If the delta is >X/2, maybe it can kick off a workqueue job or > > > > > something that calls write_inode with WB_SYNC_ALL to try to get > > > > > the thing onto the platter ASAP. > > > > > > > > Still looking at this bit too. Probably we can just kick off a > > > > WB_SYNC_NONE filemap_fdatawrite at that point and hope for the > > > > best? > > > > > > "Hope" is not a great assurance regarding data integrity ;) Anyway, > > > it depends on how you imagine the "i_version on disk" is going to be > > > maintained. It could be maintained by NFSD inside > > > commit_inode_metadata() - > > > fetch current i_version value before asking filesystem for the sync > > > and by the time commit_metadata() returns we know that value is on > > > disk. If we detect the current - on_disk is > X/2, we call > > > commit_inode_metadata() and we are done. It is not even *that* > > > expensive because usually filesystems optimize away unnecessary IO > > > when the inode didn't change since last time it got synced. > > > > > > > Note that these approaches requiring 3rd party help in order to track > > i_version integrity across filesystem crashes all make the idea of > > adding i_version to statx() a no-go. > > > > It is one thing for knfsd to add specialised machinery for integrity > > checking, but if all applications need to do so, then they are highly > > unlikely to want to adopt this attribute. > > > > > > Absolutely. That is the downside of this approach, but the priority here has > always been to improve nfsd. If we don't get the ability to present this info via > statx, then so be it. Later on, I suppose we can move that handling into the > kernel in some fashion if we decide it's worthwhile. > > That said, not having this in statx makes it more difficult to test i_version > behavior. Maybe we can add a generic ioctl for that in the interim? Having i_version in statx would be really nice for nfs-ganesha. I would consider doing the extra integrity stuff or we may in some cases be able to rely on the filesystem, but in any case, i_version would be an improvement over using ctime or mtime for a change attribute. Frank
On Fri, 23 Sep 2022, Jeff Layton wrote: > > Absolutely. That is the downside of this approach, but the priority here > has always been to improve nfsd. If we don't get the ability to present > this info via statx, then so be it. Later on, I suppose we can move that > handling into the kernel in some fashion if we decide it's worthwhile. > > That said, not having this in statx makes it more difficult to test > i_version behavior. Maybe we can add a generic ioctl for that in the > interim? I wonder if we are over-thinking this, trying too hard, making "perfect" the enemy of "good". While we agree that the current implementation of i_version is imperfect, it isn't causing major data corruption all around the world. I don't think there are even any known bug reports are there? So while we do want to fix it as best we can, we don't need to make that the first priority. I think the first priority should be to document how we want it to work, which is what this thread is really all about. The documentation can note that some (all) filesystems do not provide perfect semantics across unclean restarts, and can list any other anomalies that we are aware of. And on that basis we can export the current i_version to user-space via statx and start trying to write some test code. We can then look at moving the i_version/ctime update from *before* the write to *after* the write, and any other improvements that can be achieved easily in common code. We can then update the man page to say "since Linux 6.42, this list of anomalies is no longer present". Then we can explore some options for handling unclean restart - in a context where we can write tests and maybe even demonstrate a concrete problem before we start trying to fix it. NeilBrown
On Tue, 2022-09-27 at 08:43 +1000, NeilBrown wrote: > On Fri, 23 Sep 2022, Jeff Layton wrote: > > > > Absolutely. That is the downside of this approach, but the priority here > > has always been to improve nfsd. If we don't get the ability to present > > this info via statx, then so be it. Later on, I suppose we can move that > > handling into the kernel in some fashion if we decide it's worthwhile. > > > > That said, not having this in statx makes it more difficult to test > > i_version behavior. Maybe we can add a generic ioctl for that in the > > interim? > > I wonder if we are over-thinking this, trying too hard, making "perfect" > the enemy of "good". I tend to think we are. > While we agree that the current implementation of i_version is > imperfect, it isn't causing major data corruption all around the world. > I don't think there are even any known bug reports are there? > So while we do want to fix it as best we can, we don't need to make that > the first priority. > I'm not aware of any bug reports aside from the issue of atime updates affecting the change attribute, but the effects of misbehavior here can be very subtle. > I think the first priority should be to document how we want it to work, > which is what this thread is really all about. The documentation can > note that some (all) filesystems do not provide perfect semantics across > unclean restarts, and can list any other anomalies that we are aware of. > And on that basis we can export the current i_version to user-space via > statx and start trying to write some test code. > > We can then look at moving the i_version/ctime update from *before* the > write to *after* the write, and any other improvements that can be > achieved easily in common code. We can then update the man page to say > "since Linux 6.42, this list of anomalies is no longer present". > I have a patch for this for ext4, and started looking at the same for btrfs and xfs. > Then we can explore some options for handling unclean restart - in a > context where we can write tests and maybe even demonstrate a concrete > problem before we start trying to fix it. > I think too that we need to recognize that there are multiple distinct issues around i_version handling: 1/ atime updates affecting i_version in ext4 and xfs, which harms performance 2/ ext4 should enable the change attribute by default 3/ we currently mix the ctime into the change attr for directories, which is unnecessary. 4/ we'd like to be able to report NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR from nfsd, but the change attr on regular files can appear to go backward after a crash+clock jump. 5/ testing i_version behavior is very difficult since there is no way to query it from userland. We can work on the first three without having to solve the last two right away.
On Tue, 2022-09-27 at 08:43 +1000, NeilBrown wrote: > On Fri, 23 Sep 2022, Jeff Layton wrote: > > > > Absolutely. That is the downside of this approach, but the priority here > > has always been to improve nfsd. If we don't get the ability to present > > this info via statx, then so be it. Later on, I suppose we can move that > > handling into the kernel in some fashion if we decide it's worthwhile. > > > > That said, not having this in statx makes it more difficult to test > > i_version behavior. Maybe we can add a generic ioctl for that in the > > interim? > > I wonder if we are over-thinking this, trying too hard, making "perfect" > the enemy of "good". > While we agree that the current implementation of i_version is > imperfect, it isn't causing major data corruption all around the world. > I don't think there are even any known bug reports are there? > So while we do want to fix it as best we can, we don't need to make that > the first priority. > > I think the first priority should be to document how we want it to work, > which is what this thread is really all about. The documentation can > note that some (all) filesystems do not provide perfect semantics across > unclean restarts, and can list any other anomalies that we are aware of. > And on that basis we can export the current i_version to user-space via > statx and start trying to write some test code. > > We can then look at moving the i_version/ctime update from *before* the > write to *after* the write, and any other improvements that can be > achieved easily in common code. We can then update the man page to say > "since Linux 6.42, this list of anomalies is no longer present". > > Then we can explore some options for handling unclean restart - in a > context where we can write tests and maybe even demonstrate a concrete > problem before we start trying to fix it. > We can also argue that crash resilience isn't a hard requirement for all possible applications. We'll definitely need some sort of mitigation for nfsd so we can claim that it's MONOTONIC [1], but local applications may not care whether the value rolls backward after a crash, since they would have presumably crashed as well and may not be persisting values. IOW, I think I agree with Dave C. that crash resilience for regular files is best handled at the application level (with the first application being knfsd). RFC 7862 requires that the change_attr_type be homogeneous across the entire filesystem, so we don't have the option of deciding that on a per-inode basis. If we want to advertise it, we have ensure that all inode types conform. I think for nfsd, a crash counter tracked in userland by nfsdcld multiplied by some large number of reasonable version bumps in a jiffy would work well and allow us to go back to advertising the value as MONOTONIC. That's a bit of a project though and may take a while. For presentation via statx, maybe we can create a STATX_ATTR_VERSION_MONOTONIC bit for stx_attributes for when the filesystem can provide that sort of guarantee. I may just add that internally for now anyway, since that would make for nicer layering. [1]: https://datatracker.ietf.org/doc/html/rfc7862#section-12.2.3
diff --git a/man2/statx.2 b/man2/statx.2 index 0d1b4591f74c..d98d5148a442 100644 --- a/man2/statx.2 +++ b/man2/statx.2 @@ -62,6 +62,7 @@ struct statx { __u32 stx_dev_major; /* Major ID */ __u32 stx_dev_minor; /* Minor ID */ __u64 stx_mnt_id; /* Mount ID */ + __u64 stx_ino_version; /* Inode change attribute */ }; .EE .in @@ -247,6 +248,7 @@ STATX_BTIME Want stx_btime STATX_ALL The same as STATX_BASIC_STATS | STATX_BTIME. It is deprecated and should not be used. STATX_MNT_ID Want stx_mnt_id (since Linux 5.8) +STATX_INO_VERSION Want stx_ino_version (DRAFT) .TE .in .PP @@ -407,10 +409,16 @@ This is the same number reported by .BR name_to_handle_at (2) and corresponds to the number in the first field in one of the records in .IR /proc/self/mountinfo . +.TP +.I stx_ino_version +The inode version, also known as the inode change attribute. See +.BR inode (7) +for details. .PP For further information on the above fields, see .BR inode (7). .\" +.TP .SS File attributes The .I stx_attributes diff --git a/man7/inode.7 b/man7/inode.7 index 9b255a890720..8e83836594d8 100644 --- a/man7/inode.7 +++ b/man7/inode.7 @@ -184,6 +184,12 @@ Last status change timestamp (ctime) This is the file's last status change timestamp. It is changed by writing or by setting inode information (i.e., owner, group, link count, mode, etc.). +.TP +Inode version (i_version) +(not returned in the \fIstat\fP structure); \fIstatx.stx_ino_version\fP +.IP +This is the inode change counter. See the discussion of +\fBthe inode version counter\fP, below. .PP The timestamp fields report time measured with a zero point at the .IR Epoch , @@ -424,6 +430,39 @@ on a directory means that a file in that directory can be renamed or deleted only by the owner of the file, by the owner of the directory, and by a privileged process. +.SS The inode version counter +.PP +The +.I statx.stx_ino_version +field is the inode change counter. Any operation that would result in a +change to \fIstatx.stx_ctime\fP must result in an increase to this value. +The value must increase even in the case where the ctime change is not +evident due to coarse timestamp granularity. +.PP +An observer cannot infer anything from amount of increase about the +nature or magnitude of the change. If the returned value is different +from the last time it was checked, then something has made an explicit +data and/or metadata change to the inode. +.PP +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the +other changes in the inode. On a write, for instance, the i_version it usually +incremented before the data is copied into the pagecache. Therefore it is +possible to see a new i_version value while a read still shows the old data. +.PP +In the event of a system crash, this value can appear to go backward, +if it were queried before ever being written to the backing store. If +the value were then incremented again after restart, then an observer +could miss noticing a change. +.PP +In order to guard against this, it is recommended to also watch the +\fIstatx.stx_ctime\fP for changes when watching this value. As long as the +system clock doesn't jump backward during the crash, an observer can be +reasonably sure that the i_version and ctime together represent a unique inode +state. +.PP +The i_version is a Linux extension and is not supported by all filesystems. +The application must verify that the \fISTATX_INO_VERSION\fP bit is set in the +returned \fIstatx.stx_mask\fP before relying on this field. .SH STANDARDS If you need to obtain the definition of the .I blkcnt_t
I'm proposing to expose the inode change attribute via statx [1]. Document what this value means and what an observer can infer from it changing. Signed-off-by: Jeff Layton <jlayton@kernel.org> [1]: https://lore.kernel.org/linux-nfs/20220826214703.134870-1-jlayton@kernel.org/T/#t --- man2/statx.2 | 8 ++++++++ man7/inode.7 | 39 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 47 insertions(+) v4: add paragraph pointing out the lack of atomicity wrt other changes I think these patches are racing with another change to add DIO alignment info to statx. I imagine this will go in after that, so this will probably need to be respun to account for contextual differences. What I'm mostly interested in here is getting the sematics and description of the i_version counter nailed down.