Message ID | 20240711-mgtime-v5-5-37bb5b465feb@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs: multigrain timestamp redux | expand |
On Thu, Jul 11, 2024 at 07:08:09AM -0400, Jeff Layton wrote: > Add a high-level document that describes how multigrain timestamps work, > rationale for them, and some info about implementation and tradeoffs. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > Documentation/filesystems/multigrain-ts.rst | 120 ++++++++++++++++++++++++++++ > 1 file changed, 120 insertions(+) > > diff --git a/Documentation/filesystems/multigrain-ts.rst b/Documentation/filesystems/multigrain-ts.rst > new file mode 100644 > index 000000000000..5cefc204ecec > --- /dev/null > +++ b/Documentation/filesystems/multigrain-ts.rst > @@ -0,0 +1,120 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +===================== > +Multigrain Timestamps > +===================== > + > +Introduction > +============ > +Historically, the kernel has always used coarse time values to stamp > +inodes. This value is updated on every jiffy, so any change that happens > +within that jiffy will end up with the same timestamp. > + > +When the kernel goes to stamp an inode (due to a read or write), it first gets > +the current time and then compares it to the existing timestamp(s) to see > +whether anything will change. If nothing changed, then it can avoid updating > +the inode's metadata. > + > +Coarse timestamps are therefore good from a performance standpoint, since they > +reduce the need for metadata updates, but bad from the standpoint of > +determining whether anything has changed, since a lot of things can happen in a > +jiffy. > + > +They are particularly troublesome with NFSv3, where unchanging timestamps can > +make it difficult to tell whether to invalidate caches. NFSv4 provides a > +dedicated change attribute that should always show a visible change, but not > +all filesystems implement this properly, causing the NFS server to substitute > +the ctime in many cases. > + > +Multigrain timestamps aim to remedy this by selectively using fine-grained > +timestamps when a file has had its timestamps queried recently, and the current > +coarse-grained time does not cause a change. > + > +Inode Timestamps > +================ > +There are currently 3 timestamps in the inode that are updated to the current > +wallclock time on different activity: > + > +ctime: > + The inode change time. This is stamped with the current time whenever > + the inode's metadata is changed. Note that this value is not settable > + from userland. > + > +mtime: > + The inode modification time. This is stamped with the current time > + any time a file's contents change. > + > +atime: > + The inode access time. This is stamped whenever an inode's contents are > + read. Widely considered to be a terrible mistake. Usually avoided with > + options like noatime or relatime. And for btime/crtime (aka creation time) a filesystem can take the coarse timestamp, right? It's not settable by userspace, and I think statx is the only way those are ever exposed. QUERIED is never set when the file is being created. > +Updating the mtime always implies a change to the ctime, but updating the > +atime due to a read request does not. > + > +Multigrain timestamps are only tracked for the ctime and the mtime. atimes are > +not affected and always use the coarse-grained value (subject to the floor). Is it ok if an atime update uses the same timespec as was used for a ctime update? There's a pending update for 6.11 that changes xfs_trans_ichgtime to do: tv = current_time(inode); if (flags & XFS_ICHGTIME_MOD) inode_set_mtime_to_ts(inode, tv); if (flags & XFS_ICHGTIME_CHG) inode_set_ctime_to_ts(inode, tv); if (flags & XFS_ICHGTIME_ACCESS) inode_set_atime_to_ts(inode, tv); if (flags & XFS_ICHGTIME_CREATE) ip->i_crtime = tv; So I guess xfs could do something like this to set @tv: if (flags & XFS_ICHGTIME_CHG) tv = inode_set_ctime_current(inode); else tv = current_time(); ... if (flags & XFS_ICHGTIME_ACCESS) inode_set_atime_to_ts(inode, tv); Thoughts? > +Inode Timestamp Ordering > +======================== > + > +In addition to just providing info about changes to individual files, file > +timestamps also serve an important purpose in applications like "make". These > +programs measure timestamps in order to determine whether source files might be > +newer than cached objects. > + > +Userland applications like make can only determine ordering based on > +operational boundaries. For a syscall those are the syscall entry and exit > +points. For io_uring or nfsd operations, that's the request submission and > +response. In the case of concurrent operations, userland can make no > +determination about the order in which things will occur. > + > +For instance, if a single thread modifies one file, and then another file in > +sequence, the second file must show an equal or later mtime than the first. The > +same is true if two threads are issuing similar operations that do not overlap > +in time. > + > +If however, two threads have racing syscalls that overlap in time, then there > +is no such guarantee, and the second file may appear to have been modified > +before, after or at the same time as the first, regardless of which one was > +submitted first. > + > +Multigrain Timestamps > +===================== > +Multigrain timestamps are aimed at ensuring that changes to a single file are > +always recognizable, without violating the ordering guarantees when multiple > +different files are modified. This affects the mtime and the ctime, but the > +atime will always use coarse-grained timestamps. > + > +It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime > +or ctime has been queried. If either or both have, then the kernel takes > +special care to ensure the next timestamp update will display a visible change. > +This ensures tight cache coherency for use-cases like NFS, without sacrificing > +the benefits of reduced metadata updates when files aren't being watched. > + > +The Ctime Floor Value > +===================== > +It's not sufficient to simply use fine or coarse-grained timestamps based on > +whether the mtime or ctime has been queried. A file could get a fine grained > +timestamp, and then a second file modified later could get a coarse-grained one > +that appears earlier than the first, which would break the kernel's timestamp > +ordering guarantees. > + > +To mitigate this problem, we maintain a global floor value that ensures that > +this can't happen. The two files in the above example may appear to have been > +modified at the same time in such a case, but they will never show the reverse > +order. To avoid problems with realtime clock jumps, the floor is managed as a > +monotonic ktime_t, and the values are converted to realtime clock values as > +needed. monotonic atomic64_t? --D > + > +Implementation Notes > +==================== > +Multigrain timestamps are intended for use by local filesystems that get > +ctime values from the local clock. This is in contrast to network filesystems > +and the like that just mirror timestamp values from a server. > + > +For most filesystems, it's sufficient to just set the FS_MGTIME flag in the > +fstype->fs_flags in order to opt-in, providing the ctime is only ever set via > +inode_set_ctime_current(). If the filesystem has a ->getattr routine that > +doesn't call generic_fillattr, then you should have it call fill_mg_cmtime to > +fill those values. > > -- > 2.45.2 > >
On Thu, 2024-07-11 at 12:12 -0700, Darrick J. Wong wrote: > On Thu, Jul 11, 2024 at 07:08:09AM -0400, Jeff Layton wrote: > > Add a high-level document that describes how multigrain timestamps work, > > rationale for them, and some info about implementation and tradeoffs. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > Documentation/filesystems/multigrain-ts.rst | 120 ++++++++++++++++++++++++++++ > > 1 file changed, 120 insertions(+) > > > > diff --git a/Documentation/filesystems/multigrain-ts.rst b/Documentation/filesystems/multigrain-ts.rst > > new file mode 100644 > > index 000000000000..5cefc204ecec > > --- /dev/null > > +++ b/Documentation/filesystems/multigrain-ts.rst > > @@ -0,0 +1,120 @@ > > +.. SPDX-License-Identifier: GPL-2.0 > > + > > +===================== > > +Multigrain Timestamps > > +===================== > > + > > +Introduction > > +============ > > +Historically, the kernel has always used coarse time values to stamp > > +inodes. This value is updated on every jiffy, so any change that happens > > +within that jiffy will end up with the same timestamp. > > + > > +When the kernel goes to stamp an inode (due to a read or write), it first gets > > +the current time and then compares it to the existing timestamp(s) to see > > +whether anything will change. If nothing changed, then it can avoid updating > > +the inode's metadata. > > + > > +Coarse timestamps are therefore good from a performance standpoint, since they > > +reduce the need for metadata updates, but bad from the standpoint of > > +determining whether anything has changed, since a lot of things can happen in a > > +jiffy. > > + > > +They are particularly troublesome with NFSv3, where unchanging timestamps can > > +make it difficult to tell whether to invalidate caches. NFSv4 provides a > > +dedicated change attribute that should always show a visible change, but not > > +all filesystems implement this properly, causing the NFS server to substitute > > +the ctime in many cases. > > + > > +Multigrain timestamps aim to remedy this by selectively using fine-grained > > +timestamps when a file has had its timestamps queried recently, and the current > > +coarse-grained time does not cause a change. > > + > > +Inode Timestamps > > +================ > > +There are currently 3 timestamps in the inode that are updated to the current > > +wallclock time on different activity: > > + > > +ctime: > > + The inode change time. This is stamped with the current time whenever > > + the inode's metadata is changed. Note that this value is not settable > > + from userland. > > + > > +mtime: > > + The inode modification time. This is stamped with the current time > > + any time a file's contents change. > > + > > +atime: > > + The inode access time. This is stamped whenever an inode's contents are > > + read. Widely considered to be a terrible mistake. Usually avoided with > > + options like noatime or relatime. > > And for btime/crtime (aka creation time) a filesystem can take the > coarse timestamp, right? It's not settable by userspace, and I think > statx is the only way those are ever exposed. QUERIED is never set when > the file is being created. > Yep. I'd just copy the ctime to the btime after it's set on creation so that everything lines up nicely. > > +Updating the mtime always implies a change to the ctime, but updating the > > +atime due to a read request does not. > > + > > +Multigrain timestamps are only tracked for the ctime and the mtime. atimes are > > +not affected and always use the coarse-grained value (subject to the floor). > > Is it ok if an atime update uses the same timespec as was used for a > ctime update? There's a pending update for 6.11 that changes > xfs_trans_ichgtime to do: > > tv = current_time(inode); > > if (flags & XFS_ICHGTIME_MOD) > inode_set_mtime_to_ts(inode, tv); > if (flags & XFS_ICHGTIME_CHG) > inode_set_ctime_to_ts(inode, tv); > if (flags & XFS_ICHGTIME_ACCESS) > inode_set_atime_to_ts(inode, tv); > if (flags & XFS_ICHGTIME_CREATE) > ip->i_crtime = tv; > Yeah, that should be fine. If you were doing some (hypothetical) operation that needs to set both the ctime and the atime, then the natural thing to do is to just let the atime's value "flow" from the updated ctime. > So I guess xfs could do something like this to set @tv: > > if (flags & XFS_ICHGTIME_CHG) > tv = inode_set_ctime_current(inode); > else > tv = current_time(); > ... > if (flags & XFS_ICHGTIME_ACCESS) > inode_set_atime_to_ts(inode, tv); > > Thoughts? > Yes, that should be fine. It's pretty similar to what we do in inode_update_timestamps(): if (flags & (S_MTIME|S_CTIME|S_VERSION)) { ... now = inode_set_ctime_current(inode); ... } else { now = current_time(inode); } In practice, a mtime or version change implies a ctime change, whereas an atime change generally doesn't. Still, I set up the infrastructure to handle it properly if the ctime and atime are ever updated together. > > +Inode Timestamp Ordering > > +======================== > > + > > +In addition to just providing info about changes to individual files, file > > +timestamps also serve an important purpose in applications like "make". These > > +programs measure timestamps in order to determine whether source files might be > > +newer than cached objects. > > + > > +Userland applications like make can only determine ordering based on > > +operational boundaries. For a syscall those are the syscall entry and exit > > +points. For io_uring or nfsd operations, that's the request submission and > > +response. In the case of concurrent operations, userland can make no > > +determination about the order in which things will occur. > > + > > +For instance, if a single thread modifies one file, and then another file in > > +sequence, the second file must show an equal or later mtime than the first. The > > +same is true if two threads are issuing similar operations that do not overlap > > +in time. > > + > > +If however, two threads have racing syscalls that overlap in time, then there > > +is no such guarantee, and the second file may appear to have been modified > > +before, after or at the same time as the first, regardless of which one was > > +submitted first. > > + > > +Multigrain Timestamps > > +===================== > > +Multigrain timestamps are aimed at ensuring that changes to a single file are > > +always recognizable, without violating the ordering guarantees when multiple > > +different files are modified. This affects the mtime and the ctime, but the > > +atime will always use coarse-grained timestamps. > > + > > +It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime > > +or ctime has been queried. If either or both have, then the kernel takes > > +special care to ensure the next timestamp update will display a visible change. > > +This ensures tight cache coherency for use-cases like NFS, without sacrificing > > +the benefits of reduced metadata updates when files aren't being watched. > > + > > +The Ctime Floor Value > > +===================== > > +It's not sufficient to simply use fine or coarse-grained timestamps based on > > +whether the mtime or ctime has been queried. A file could get a fine grained > > +timestamp, and then a second file modified later could get a coarse-grained one > > +that appears earlier than the first, which would break the kernel's timestamp > > +ordering guarantees. > > + > > +To mitigate this problem, we maintain a global floor value that ensures that > > +this can't happen. The two files in the above example may appear to have been > > +modified at the same time in such a case, but they will never show the reverse > > +order. To avoid problems with realtime clock jumps, the floor is managed as a > > +monotonic ktime_t, and the values are converted to realtime clock values as > > +needed. > > monotonic atomic64_t? > It is an atomic64_t, but the values come from the ktime_get_* functions, so we use the value as a ktime_t. Both are typedefs of s64 though, so casting between them is seamless. I'll see if I can make that clearer in the doc. > --D > > > + > > +Implementation Notes > > +==================== > > +Multigrain timestamps are intended for use by local filesystems that get > > +ctime values from the local clock. This is in contrast to network filesystems > > +and the like that just mirror timestamp values from a server. > > + > > +For most filesystems, it's sufficient to just set the FS_MGTIME flag in the > > +fstype->fs_flags in order to opt-in, providing the ctime is only ever set via > > +inode_set_ctime_current(). If the filesystem has a ->getattr routine that > > +doesn't call generic_fillattr, then you should have it call fill_mg_cmtime to > > +fill those values. > > > > -- > > 2.45.2 > > > > Thanks!
diff --git a/Documentation/filesystems/multigrain-ts.rst b/Documentation/filesystems/multigrain-ts.rst new file mode 100644 index 000000000000..5cefc204ecec --- /dev/null +++ b/Documentation/filesystems/multigrain-ts.rst @@ -0,0 +1,120 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Multigrain Timestamps +===================== + +Introduction +============ +Historically, the kernel has always used coarse time values to stamp +inodes. This value is updated on every jiffy, so any change that happens +within that jiffy will end up with the same timestamp. + +When the kernel goes to stamp an inode (due to a read or write), it first gets +the current time and then compares it to the existing timestamp(s) to see +whether anything will change. If nothing changed, then it can avoid updating +the inode's metadata. + +Coarse timestamps are therefore good from a performance standpoint, since they +reduce the need for metadata updates, but bad from the standpoint of +determining whether anything has changed, since a lot of things can happen in a +jiffy. + +They are particularly troublesome with NFSv3, where unchanging timestamps can +make it difficult to tell whether to invalidate caches. NFSv4 provides a +dedicated change attribute that should always show a visible change, but not +all filesystems implement this properly, causing the NFS server to substitute +the ctime in many cases. + +Multigrain timestamps aim to remedy this by selectively using fine-grained +timestamps when a file has had its timestamps queried recently, and the current +coarse-grained time does not cause a change. + +Inode Timestamps +================ +There are currently 3 timestamps in the inode that are updated to the current +wallclock time on different activity: + +ctime: + The inode change time. This is stamped with the current time whenever + the inode's metadata is changed. Note that this value is not settable + from userland. + +mtime: + The inode modification time. This is stamped with the current time + any time a file's contents change. + +atime: + The inode access time. This is stamped whenever an inode's contents are + read. Widely considered to be a terrible mistake. Usually avoided with + options like noatime or relatime. + +Updating the mtime always implies a change to the ctime, but updating the +atime due to a read request does not. + +Multigrain timestamps are only tracked for the ctime and the mtime. atimes are +not affected and always use the coarse-grained value (subject to the floor). + +Inode Timestamp Ordering +======================== + +In addition to just providing info about changes to individual files, file +timestamps also serve an important purpose in applications like "make". These +programs measure timestamps in order to determine whether source files might be +newer than cached objects. + +Userland applications like make can only determine ordering based on +operational boundaries. For a syscall those are the syscall entry and exit +points. For io_uring or nfsd operations, that's the request submission and +response. In the case of concurrent operations, userland can make no +determination about the order in which things will occur. + +For instance, if a single thread modifies one file, and then another file in +sequence, the second file must show an equal or later mtime than the first. The +same is true if two threads are issuing similar operations that do not overlap +in time. + +If however, two threads have racing syscalls that overlap in time, then there +is no such guarantee, and the second file may appear to have been modified +before, after or at the same time as the first, regardless of which one was +submitted first. + +Multigrain Timestamps +===================== +Multigrain timestamps are aimed at ensuring that changes to a single file are +always recognizable, without violating the ordering guarantees when multiple +different files are modified. This affects the mtime and the ctime, but the +atime will always use coarse-grained timestamps. + +It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime +or ctime has been queried. If either or both have, then the kernel takes +special care to ensure the next timestamp update will display a visible change. +This ensures tight cache coherency for use-cases like NFS, without sacrificing +the benefits of reduced metadata updates when files aren't being watched. + +The Ctime Floor Value +===================== +It's not sufficient to simply use fine or coarse-grained timestamps based on +whether the mtime or ctime has been queried. A file could get a fine grained +timestamp, and then a second file modified later could get a coarse-grained one +that appears earlier than the first, which would break the kernel's timestamp +ordering guarantees. + +To mitigate this problem, we maintain a global floor value that ensures that +this can't happen. The two files in the above example may appear to have been +modified at the same time in such a case, but they will never show the reverse +order. To avoid problems with realtime clock jumps, the floor is managed as a +monotonic ktime_t, and the values are converted to realtime clock values as +needed. + +Implementation Notes +==================== +Multigrain timestamps are intended for use by local filesystems that get +ctime values from the local clock. This is in contrast to network filesystems +and the like that just mirror timestamp values from a server. + +For most filesystems, it's sufficient to just set the FS_MGTIME flag in the +fstype->fs_flags in order to opt-in, providing the ctime is only ever set via +inode_set_ctime_current(). If the filesystem has a ->getattr routine that +doesn't call generic_fillattr, then you should have it call fill_mg_cmtime to +fill those values.
Add a high-level document that describes how multigrain timestamps work, rationale for them, and some info about implementation and tradeoffs. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- Documentation/filesystems/multigrain-ts.rst | 120 ++++++++++++++++++++++++++++ 1 file changed, 120 insertions(+)