Message ID | 20241001-mgtime-v8-2-903343d91bc3@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs: multigrain timestamp redux | expand |
On Tue 01-10-24 06:58:56, Jeff Layton wrote: > The VFS has always used coarse-grained timestamps when updating the > ctime and mtime after a change. This has the benefit of allowing > filesystems to optimize away a lot metadata updates, down to around 1 > per jiffy, even when a file is under heavy writes. > > Unfortunately, this has always been an issue when we're exporting via > NFSv3, which relies on timestamps to validate caches. A lot of changes > can happen in a jiffy, so timestamps aren't sufficient to help the > client decide when to invalidate the cache. Even with NFSv4, a lot of > exported filesystems don't properly support a change attribute and are > subject to the same problems with timestamp granularity. Other > applications have similar issues with timestamps (e.g backup > applications). > > If we were to always use fine-grained timestamps, that would improve the > situation, but that becomes rather expensive, as the underlying > filesystem would have to log a lot more metadata updates. > > What we need is a way to only use fine-grained timestamps when they are > being actively queried. Use the (unused) top bit in inode->i_ctime_nsec > as a flag that indicates whether the current timestamps have been > queried via stat() or the like. When it's set, we allow the kernel to > use a fine-grained timestamp iff it's necessary to make the ctime show > a different value. > > This solves the problem of being able to distinguish the timestamp > between updates, but introduces a new problem: it's now possible for a > file being changed to get a fine-grained timestamp. A file that is > altered just a bit later can then get a coarse-grained one that appears > older than the earlier fine-grained time. This violates timestamp > ordering guarantees. > > To remedy this, keep a global monotonic atomic64_t value that acts as a > timestamp floor. When we go to stamp a file, we first get the latter of > the current floor value and the current coarse-grained time. If the > inode ctime hasn't been queried then we just attempt to stamp it with > that value. > > If it has been queried, then first see whether the current coarse time > is later than the existing ctime. If it is, then we accept that value. > If it isn't, then we get a fine-grained timestamp. > > Filesystems can opt into this by setting the FS_MGTIME fstype flag. > Others should be unaffected (other than being subject to the same floor > value as multigrain filesystems). > > Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits > Signed-off-by: Jeff Layton <jlayton@kernel.org> Mostly looks good. Some smaller comments below. > +/** > + * current_time - Return FS time (possibly fine-grained) > + * @inode: inode. > + * > + * Return the current time truncated to the time granularity supported by > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged > + * as having been QUERIED, get a fine-grained timestamp, but don't update > + * the floor. > + * > + * For a multigrain inode, this is effectively an estimate of the timestamp > + * that a file would receive. An actual update must go through > + * inode_set_ctime_current(). > + */ > +struct timespec64 current_time(struct inode *inode) > +{ > + struct timespec64 now; > + u32 cns; > + > + ktime_get_coarse_real_ts64_mg(&now); > + > + if (!is_mgtime(inode)) > + goto out; > + > + /* If nothing has queried it, then coarse time is fine */ > + cns = smp_load_acquire(&inode->i_ctime_nsec); > + if (cns & I_CTIME_QUERIED) { > + /* > + * If there is no apparent change, then get a fine-grained > + * timestamp. > + */ > + if (now.tv_nsec == (cns & ~I_CTIME_QUERIED)) > + ktime_get_real_ts64(&now); > + } > +out: > + return timestamp_truncate(now, inode); > +} > +EXPORT_SYMBOL(current_time); > + > static int inode_needs_update_time(struct inode *inode) > { > + struct timespec64 now, ts; > int sync_it = 0; > - struct timespec64 now = current_time(inode); > - struct timespec64 ts; > > /* First try to exhaust all avenues to not sync */ > if (IS_NOCMTIME(inode)) > return 0; > > + now = current_time(inode); > + > ts = inode_get_mtime(inode); > if (!timespec64_equal(&ts, &now)) > - sync_it = S_MTIME; > + sync_it |= S_MTIME; > > ts = inode_get_ctime(inode); > if (!timespec64_equal(&ts, &now)) > @@ -2598,6 +2637,15 @@ void inode_nohighmem(struct inode *inode) > } > EXPORT_SYMBOL(inode_nohighmem); > > +struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts) > +{ > + set_normalized_timespec64(&ts, ts.tv_sec, ts.tv_nsec); > + inode->i_ctime_sec = ts.tv_sec; > + inode->i_ctime_nsec = ts.tv_nsec; > + return ts; > +} > +EXPORT_SYMBOL(inode_set_ctime_to_ts); > + > /** > * timestamp_truncate - Truncate timespec to a granularity > * @t: Timespec > @@ -2630,36 +2678,75 @@ struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode) > EXPORT_SYMBOL(timestamp_truncate); > > /** > - * current_time - Return FS time > - * @inode: inode. > + * inode_set_ctime_current - set the ctime to current_time > + * @inode: inode > * > - * Return the current time truncated to the time granularity supported by > - * the fs. > + * Set the inode's ctime to the current value for the inode. Returns the > + * current value that was assigned. If this is not a multigrain inode, then we > + * set it to the later of the coarse time and floor value. > * > - * Note that inode and inode->sb cannot be NULL. > - * Otherwise, the function warns and returns time without truncation. > + * If it is multigrain, then we first see if the coarse-grained timestamp is > + * distinct from what we have. If so, then we'll just use that. If we have to > + * get a fine-grained timestamp, then do so, and try to swap it into the floor. > + * We accept the new floor value regardless of the outcome of the cmpxchg. > + * After that, we try to swap the new value into i_ctime_nsec. Again, we take > + * the resulting ctime, regardless of the outcome of the swap. This comment seems outdated now. No floor in this function anymore... > -struct timespec64 current_time(struct inode *inode) > +struct timespec64 inode_set_ctime_current(struct inode *inode) > { > struct timespec64 now; > + u32 cns, cur; ... > diff --git a/fs/stat.c b/fs/stat.c > index 41e598376d7e..381926fb405f 100644 > --- a/fs/stat.c > +++ b/fs/stat.c > @@ -26,6 +26,35 @@ > #include "internal.h" > #include "mount.h" > > +/** > + * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED > + * @stat: where to store the resulting values > + * @request_mask: STATX_* values requested > + * @inode: inode from which to grab the c/mtime > + * > + * Given @inode, grab the ctime and mtime out if it and store the result ^^ of > + * in @stat. When fetching the value, flag it as QUERIED (if not already) > + * so the next write will record a distinct timestamp. > + */ > +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode) > +{ Given how things worked out in the end, it seems this function doesn't need to handle mtime at all and we can move mtime handling back to shared generic code? > + atomic_t *pcn = (atomic_t *)&inode->i_ctime_nsec; > + > + /* If neither time was requested, then don't report them */ > + if (!(request_mask & (STATX_CTIME|STATX_MTIME))) { > + stat->result_mask &= ~(STATX_CTIME|STATX_MTIME); > + return; > + } > + > + stat->mtime = inode_get_mtime(inode); > + stat->ctime.tv_sec = inode->i_ctime_sec; > + stat->ctime.tv_nsec = (u32)atomic_read(pcn); > + if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED)) > + stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)); > + stat->ctime.tv_nsec &= ~I_CTIME_QUERIED; > +} > +EXPORT_SYMBOL(fill_mg_cmtime); > + > /** > * generic_fillattr - Fill in the basic attributes from the inode struct > * @idmap: idmap of the mount the inode was found from > @@ -58,8 +87,14 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask, > stat->rdev = inode->i_rdev; > stat->size = i_size_read(inode); > stat->atime = inode_get_atime(inode); > - stat->mtime = inode_get_mtime(inode); > - stat->ctime = inode_get_ctime(inode); > + > + if (is_mgtime(inode)) { > + fill_mg_cmtime(stat, request_mask, inode); > + } else { > + stat->ctime = inode_get_ctime(inode); > + stat->mtime = inode_get_mtime(inode); > + } > + > stat->blksize = i_blocksize(inode); > stat->blocks = inode->i_blocks; > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index e3c603d01337..23908bad166c 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1653,6 +1653,17 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode, > return inode_set_mtime_to_ts(inode, ts); > } > > +/* > + * Multigrain timestamps > + * > + * Conditionally use fine-grained ctime and mtime timestamps when there > + * are users actively observing them via getattr. The primary use-case > + * for this is NFS clients that use the ctime to distinguish between > + * different states of the file, and that are often fooled by multiple > + * operations that occur in the same coarse-grained timer tick. Again, mtime seems unaffected by mgtime changes now. > + */ > +#define I_CTIME_QUERIED ((u32)BIT(31)) > + Honza
On Tue, 2024-10-01 at 15:20 +0200, Jan Kara wrote: > On Tue 01-10-24 06:58:56, Jeff Layton wrote: > > The VFS has always used coarse-grained timestamps when updating the > > ctime and mtime after a change. This has the benefit of allowing > > filesystems to optimize away a lot metadata updates, down to around 1 > > per jiffy, even when a file is under heavy writes. > > > > Unfortunately, this has always been an issue when we're exporting via > > NFSv3, which relies on timestamps to validate caches. A lot of changes > > can happen in a jiffy, so timestamps aren't sufficient to help the > > client decide when to invalidate the cache. Even with NFSv4, a lot of > > exported filesystems don't properly support a change attribute and are > > subject to the same problems with timestamp granularity. Other > > applications have similar issues with timestamps (e.g backup > > applications). > > > > If we were to always use fine-grained timestamps, that would improve the > > situation, but that becomes rather expensive, as the underlying > > filesystem would have to log a lot more metadata updates. > > > > What we need is a way to only use fine-grained timestamps when they are > > being actively queried. Use the (unused) top bit in inode->i_ctime_nsec > > as a flag that indicates whether the current timestamps have been > > queried via stat() or the like. When it's set, we allow the kernel to > > use a fine-grained timestamp iff it's necessary to make the ctime show > > a different value. > > > > This solves the problem of being able to distinguish the timestamp > > between updates, but introduces a new problem: it's now possible for a > > file being changed to get a fine-grained timestamp. A file that is > > altered just a bit later can then get a coarse-grained one that appears > > older than the earlier fine-grained time. This violates timestamp > > ordering guarantees. > > > > To remedy this, keep a global monotonic atomic64_t value that acts as a > > timestamp floor. When we go to stamp a file, we first get the latter of > > the current floor value and the current coarse-grained time. If the > > inode ctime hasn't been queried then we just attempt to stamp it with > > that value. > > > > If it has been queried, then first see whether the current coarse time > > is later than the existing ctime. If it is, then we accept that value. > > If it isn't, then we get a fine-grained timestamp. > > > > Filesystems can opt into this by setting the FS_MGTIME fstype flag. > > Others should be unaffected (other than being subject to the same floor > > value as multigrain filesystems). > > > > Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > Mostly looks good. Some smaller comments below. > > > +/** > > + * current_time - Return FS time (possibly fine-grained) > > + * @inode: inode. > > + * > > + * Return the current time truncated to the time granularity supported by > > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged > > + * as having been QUERIED, get a fine-grained timestamp, but don't update > > + * the floor. > > + * > > + * For a multigrain inode, this is effectively an estimate of the timestamp > > + * that a file would receive. An actual update must go through > > + * inode_set_ctime_current(). > > + */ > > +struct timespec64 current_time(struct inode *inode) > > +{ > > + struct timespec64 now; > > + u32 cns; > > + > > + ktime_get_coarse_real_ts64_mg(&now); > > + > > + if (!is_mgtime(inode)) > > + goto out; > > + > > + /* If nothing has queried it, then coarse time is fine */ > > + cns = smp_load_acquire(&inode->i_ctime_nsec); > > + if (cns & I_CTIME_QUERIED) { > > + /* > > + * If there is no apparent change, then get a fine-grained > > + * timestamp. > > + */ > > + if (now.tv_nsec == (cns & ~I_CTIME_QUERIED)) > > + ktime_get_real_ts64(&now); > > + } > > +out: > > + return timestamp_truncate(now, inode); > > +} > > +EXPORT_SYMBOL(current_time); > > + > > static int inode_needs_update_time(struct inode *inode) > > { > > + struct timespec64 now, ts; > > int sync_it = 0; > > - struct timespec64 now = current_time(inode); > > - struct timespec64 ts; > > > > /* First try to exhaust all avenues to not sync */ > > if (IS_NOCMTIME(inode)) > > return 0; > > > > + now = current_time(inode); > > + > > ts = inode_get_mtime(inode); > > if (!timespec64_equal(&ts, &now)) > > - sync_it = S_MTIME; > > + sync_it |= S_MTIME; > > > > ts = inode_get_ctime(inode); > > if (!timespec64_equal(&ts, &now)) > > @@ -2598,6 +2637,15 @@ void inode_nohighmem(struct inode *inode) > > } > > EXPORT_SYMBOL(inode_nohighmem); > > > > +struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts) > > +{ > > + set_normalized_timespec64(&ts, ts.tv_sec, ts.tv_nsec); > > + inode->i_ctime_sec = ts.tv_sec; > > + inode->i_ctime_nsec = ts.tv_nsec; > > + return ts; > > +} > > +EXPORT_SYMBOL(inode_set_ctime_to_ts); > > + > > /** > > * timestamp_truncate - Truncate timespec to a granularity > > * @t: Timespec > > @@ -2630,36 +2678,75 @@ struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode) > > EXPORT_SYMBOL(timestamp_truncate); > > > > /** > > - * current_time - Return FS time > > - * @inode: inode. > > + * inode_set_ctime_current - set the ctime to current_time > > + * @inode: inode > > * > > - * Return the current time truncated to the time granularity supported by > > - * the fs. > > + * Set the inode's ctime to the current value for the inode. Returns the > > + * current value that was assigned. If this is not a multigrain inode, then we > > + * set it to the later of the coarse time and floor value. > > * > > - * Note that inode and inode->sb cannot be NULL. > > - * Otherwise, the function warns and returns time without truncation. > > + * If it is multigrain, then we first see if the coarse-grained timestamp is > > + * distinct from what we have. If so, then we'll just use that. If we have to > > + * get a fine-grained timestamp, then do so, and try to swap it into the floor. > > + * We accept the new floor value regardless of the outcome of the cmpxchg. > > + * After that, we try to swap the new value into i_ctime_nsec. Again, we take > > + * the resulting ctime, regardless of the outcome of the swap. > > This comment seems outdated now. No floor in this function anymore... > True. Will fix. > > -struct timespec64 current_time(struct inode *inode) > > +struct timespec64 inode_set_ctime_current(struct inode *inode) > > { > > struct timespec64 now; > > + u32 cns, cur; > ... > > > diff --git a/fs/stat.c b/fs/stat.c > > index 41e598376d7e..381926fb405f 100644 > > --- a/fs/stat.c > > +++ b/fs/stat.c > > @@ -26,6 +26,35 @@ > > #include "internal.h" > > #include "mount.h" > > > > +/** > > + * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED > > + * @stat: where to store the resulting values > > + * @request_mask: STATX_* values requested > > + * @inode: inode from which to grab the c/mtime > > + * > > + * Given @inode, grab the ctime and mtime out if it and store the result > ^^ of > > > + * in @stat. When fetching the value, flag it as QUERIED (if not already) > > + * so the next write will record a distinct timestamp. > > + */ > > +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode) > > +{ > > Given how things worked out in the end, it seems this function doesn't need > to handle mtime at all and we can move mtime handling back to shared generic > code? > I don't think we can. The mtime is effectively derived from the ctime. If I query only the mtime, I think it's reasonable to expect that it will change if there is another write, even if I don't query the ctime. We won't get that unless we can also set the flag in the ctime when only the mtime is requested. > > + atomic_t *pcn = (atomic_t *)&inode->i_ctime_nsec; > > + > > + /* If neither time was requested, then don't report them */ > > + if (!(request_mask & (STATX_CTIME|STATX_MTIME))) { > > + stat->result_mask &= ~(STATX_CTIME|STATX_MTIME); > > + return; > > + } > > + > > + stat->mtime = inode_get_mtime(inode); > > + stat->ctime.tv_sec = inode->i_ctime_sec; > > + stat->ctime.tv_nsec = (u32)atomic_read(pcn); > > + if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED)) > > + stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)); > > + stat->ctime.tv_nsec &= ~I_CTIME_QUERIED; > > +} > > +EXPORT_SYMBOL(fill_mg_cmtime); > > + > > /** > > * generic_fillattr - Fill in the basic attributes from the inode struct > > * @idmap: idmap of the mount the inode was found from > > @@ -58,8 +87,14 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask, > > stat->rdev = inode->i_rdev; > > stat->size = i_size_read(inode); > > stat->atime = inode_get_atime(inode); > > - stat->mtime = inode_get_mtime(inode); > > - stat->ctime = inode_get_ctime(inode); > > + > > + if (is_mgtime(inode)) { > > + fill_mg_cmtime(stat, request_mask, inode); > > + } else { > > + stat->ctime = inode_get_ctime(inode); > > + stat->mtime = inode_get_mtime(inode); > > + } > > + > > stat->blksize = i_blocksize(inode); > > stat->blocks = inode->i_blocks; > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index e3c603d01337..23908bad166c 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -1653,6 +1653,17 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode, > > return inode_set_mtime_to_ts(inode, ts); > > } > > > > +/* > > + * Multigrain timestamps > > + * > > + * Conditionally use fine-grained ctime and mtime timestamps when there > > + * are users actively observing them via getattr. The primary use-case > > + * for this is NFS clients that use the ctime to distinguish between > > + * different states of the file, and that are often fooled by multiple > > + * operations that occur in the same coarse-grained timer tick. > > Again, mtime seems unaffected by mgtime changes now. > I still think we need this. > > + */ > > +#define I_CTIME_QUERIED ((u32)BIT(31)) > > + > > Honza
On Tue 01-10-24 09:34:18, Jeff Layton wrote: > On Tue, 2024-10-01 at 15:20 +0200, Jan Kara wrote: > > > diff --git a/fs/stat.c b/fs/stat.c > > > index 41e598376d7e..381926fb405f 100644 > > > --- a/fs/stat.c > > > +++ b/fs/stat.c > > > @@ -26,6 +26,35 @@ > > > #include "internal.h" > > > #include "mount.h" > > > > > > +/** > > > + * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED > > > + * @stat: where to store the resulting values > > > + * @request_mask: STATX_* values requested > > > + * @inode: inode from which to grab the c/mtime > > > + * > > > + * Given @inode, grab the ctime and mtime out if it and store the result > > ^^ of > > > > > + * in @stat. When fetching the value, flag it as QUERIED (if not already) > > > + * so the next write will record a distinct timestamp. > > > + */ > > > +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode) > > > +{ > > > > Given how things worked out in the end, it seems this function doesn't need > > to handle mtime at all and we can move mtime handling back to shared generic > > code? > > > > I don't think we can. The mtime is effectively derived from the ctime. > > If I query only the mtime, I think it's reasonable to expect that it > will change if there is another write, even if I don't query the ctime. > We won't get that unless we can also set the flag in the ctime when > only the mtime is requested. Aha, right. I already forgot about this :). Can you please add to the comment the above explanation so that we remember next time somebody wants to "clean this up" like me ;)? Thanks! Also feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza
On Wed, 2024-10-02 at 11:14 +0200, Jan Kara wrote: > On Tue 01-10-24 09:34:18, Jeff Layton wrote: > > On Tue, 2024-10-01 at 15:20 +0200, Jan Kara wrote: > > > > diff --git a/fs/stat.c b/fs/stat.c > > > > index 41e598376d7e..381926fb405f 100644 > > > > --- a/fs/stat.c > > > > +++ b/fs/stat.c > > > > @@ -26,6 +26,35 @@ > > > > #include "internal.h" > > > > #include "mount.h" > > > > > > > > +/** > > > > + * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED > > > > + * @stat: where to store the resulting values > > > > + * @request_mask: STATX_* values requested > > > > + * @inode: inode from which to grab the c/mtime > > > > + * > > > > + * Given @inode, grab the ctime and mtime out if it and store the result > > > ^^ of > > > > > > > + * in @stat. When fetching the value, flag it as QUERIED (if not already) > > > > + * so the next write will record a distinct timestamp. > > > > + */ > > > > +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode) > > > > +{ > > > > > > Given how things worked out in the end, it seems this function doesn't need > > > to handle mtime at all and we can move mtime handling back to shared generic > > > code? > > > > > > > I don't think we can. The mtime is effectively derived from the ctime. > > > > If I query only the mtime, I think it's reasonable to expect that it > > will change if there is another write, even if I don't query the ctime. > > We won't get that unless we can also set the flag in the ctime when > > only the mtime is requested. > > Aha, right. I already forgot about this :). Can you please add to the > comment the above explanation so that we remember next time somebody wants > to "clean this up" like me ;)? Thanks! > > Also feel free to add: > > Reviewed-by: Jan Kara <jack@suse.cz> > > Honza Done, and the revised version is in my tree. Thanks for the review!
diff --git a/fs/inode.c b/fs/inode.c index 471ae4a31549..4ec1e71e9a9d 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -2209,19 +2209,58 @@ int file_remove_privs(struct file *file) } EXPORT_SYMBOL(file_remove_privs); +/** + * current_time - Return FS time (possibly fine-grained) + * @inode: inode. + * + * Return the current time truncated to the time granularity supported by + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged + * as having been QUERIED, get a fine-grained timestamp, but don't update + * the floor. + * + * For a multigrain inode, this is effectively an estimate of the timestamp + * that a file would receive. An actual update must go through + * inode_set_ctime_current(). + */ +struct timespec64 current_time(struct inode *inode) +{ + struct timespec64 now; + u32 cns; + + ktime_get_coarse_real_ts64_mg(&now); + + if (!is_mgtime(inode)) + goto out; + + /* If nothing has queried it, then coarse time is fine */ + cns = smp_load_acquire(&inode->i_ctime_nsec); + if (cns & I_CTIME_QUERIED) { + /* + * If there is no apparent change, then get a fine-grained + * timestamp. + */ + if (now.tv_nsec == (cns & ~I_CTIME_QUERIED)) + ktime_get_real_ts64(&now); + } +out: + return timestamp_truncate(now, inode); +} +EXPORT_SYMBOL(current_time); + static int inode_needs_update_time(struct inode *inode) { + struct timespec64 now, ts; int sync_it = 0; - struct timespec64 now = current_time(inode); - struct timespec64 ts; /* First try to exhaust all avenues to not sync */ if (IS_NOCMTIME(inode)) return 0; + now = current_time(inode); + ts = inode_get_mtime(inode); if (!timespec64_equal(&ts, &now)) - sync_it = S_MTIME; + sync_it |= S_MTIME; ts = inode_get_ctime(inode); if (!timespec64_equal(&ts, &now)) @@ -2598,6 +2637,15 @@ void inode_nohighmem(struct inode *inode) } EXPORT_SYMBOL(inode_nohighmem); +struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts) +{ + set_normalized_timespec64(&ts, ts.tv_sec, ts.tv_nsec); + inode->i_ctime_sec = ts.tv_sec; + inode->i_ctime_nsec = ts.tv_nsec; + return ts; +} +EXPORT_SYMBOL(inode_set_ctime_to_ts); + /** * timestamp_truncate - Truncate timespec to a granularity * @t: Timespec @@ -2630,36 +2678,75 @@ struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode) EXPORT_SYMBOL(timestamp_truncate); /** - * current_time - Return FS time - * @inode: inode. + * inode_set_ctime_current - set the ctime to current_time + * @inode: inode * - * Return the current time truncated to the time granularity supported by - * the fs. + * Set the inode's ctime to the current value for the inode. Returns the + * current value that was assigned. If this is not a multigrain inode, then we + * set it to the later of the coarse time and floor value. * - * Note that inode and inode->sb cannot be NULL. - * Otherwise, the function warns and returns time without truncation. + * If it is multigrain, then we first see if the coarse-grained timestamp is + * distinct from what we have. If so, then we'll just use that. If we have to + * get a fine-grained timestamp, then do so, and try to swap it into the floor. + * We accept the new floor value regardless of the outcome of the cmpxchg. + * After that, we try to swap the new value into i_ctime_nsec. Again, we take + * the resulting ctime, regardless of the outcome of the swap. */ -struct timespec64 current_time(struct inode *inode) +struct timespec64 inode_set_ctime_current(struct inode *inode) { struct timespec64 now; + u32 cns, cur; - ktime_get_coarse_real_ts64(&now); - return timestamp_truncate(now, inode); -} -EXPORT_SYMBOL(current_time); + ktime_get_coarse_real_ts64_mg(&now); + now = timestamp_truncate(now, inode); -/** - * inode_set_ctime_current - set the ctime to current_time - * @inode: inode - * - * Set the inode->i_ctime to the current value for the inode. Returns - * the current value that was assigned to i_ctime. - */ -struct timespec64 inode_set_ctime_current(struct inode *inode) -{ - struct timespec64 now = current_time(inode); + /* Just return that if this is not a multigrain fs */ + if (!is_mgtime(inode)) { + inode_set_ctime_to_ts(inode, now); + goto out; + } - inode_set_ctime_to_ts(inode, now); + /* + * We only need a fine-grained time if someone has queried it, + * and the current coarse grained time isn't later than what's + * already there. + */ + cns = smp_load_acquire(&inode->i_ctime_nsec); + if (cns & I_CTIME_QUERIED) { + struct timespec64 ctime = { .tv_sec = inode->i_ctime_sec, + .tv_nsec = cns & ~I_CTIME_QUERIED }; + + if (timespec64_compare(&now, &ctime) <= 0) { + ktime_get_real_ts64_mg(&now); + now = timestamp_truncate(now, inode); + } + } + + /* No need to cmpxchg if it's exactly the same */ + if (cns == now.tv_nsec && inode->i_ctime_sec == now.tv_sec) + goto out; + cur = cns; +retry: + /* Try to swap the nsec value into place. */ + if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) { + /* If swap occurred, then we're (mostly) done */ + inode->i_ctime_sec = now.tv_sec; + } else { + /* + * Was the change due to someone marking the old ctime QUERIED? + * If so then retry the swap. This can only happen once since + * the only way to clear I_CTIME_QUERIED is to stamp the inode + * with a new ctime. + */ + if (!(cns & I_CTIME_QUERIED) && (cns | I_CTIME_QUERIED) == cur) { + cns = cur; + goto retry; + } + /* Otherwise, keep the existing ctime */ + now.tv_sec = inode->i_ctime_sec; + now.tv_nsec = cur & ~I_CTIME_QUERIED; + } +out: return now; } EXPORT_SYMBOL(inode_set_ctime_current); diff --git a/fs/stat.c b/fs/stat.c index 41e598376d7e..381926fb405f 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -26,6 +26,35 @@ #include "internal.h" #include "mount.h" +/** + * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED + * @stat: where to store the resulting values + * @request_mask: STATX_* values requested + * @inode: inode from which to grab the c/mtime + * + * Given @inode, grab the ctime and mtime out if it and store the result + * in @stat. When fetching the value, flag it as QUERIED (if not already) + * so the next write will record a distinct timestamp. + */ +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode) +{ + atomic_t *pcn = (atomic_t *)&inode->i_ctime_nsec; + + /* If neither time was requested, then don't report them */ + if (!(request_mask & (STATX_CTIME|STATX_MTIME))) { + stat->result_mask &= ~(STATX_CTIME|STATX_MTIME); + return; + } + + stat->mtime = inode_get_mtime(inode); + stat->ctime.tv_sec = inode->i_ctime_sec; + stat->ctime.tv_nsec = (u32)atomic_read(pcn); + if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED)) + stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)); + stat->ctime.tv_nsec &= ~I_CTIME_QUERIED; +} +EXPORT_SYMBOL(fill_mg_cmtime); + /** * generic_fillattr - Fill in the basic attributes from the inode struct * @idmap: idmap of the mount the inode was found from @@ -58,8 +87,14 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask, stat->rdev = inode->i_rdev; stat->size = i_size_read(inode); stat->atime = inode_get_atime(inode); - stat->mtime = inode_get_mtime(inode); - stat->ctime = inode_get_ctime(inode); + + if (is_mgtime(inode)) { + fill_mg_cmtime(stat, request_mask, inode); + } else { + stat->ctime = inode_get_ctime(inode); + stat->mtime = inode_get_mtime(inode); + } + stat->blksize = i_blocksize(inode); stat->blocks = inode->i_blocks; diff --git a/include/linux/fs.h b/include/linux/fs.h index e3c603d01337..23908bad166c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1653,6 +1653,17 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode, return inode_set_mtime_to_ts(inode, ts); } +/* + * Multigrain timestamps + * + * Conditionally use fine-grained ctime and mtime timestamps when there + * are users actively observing them via getattr. The primary use-case + * for this is NFS clients that use the ctime to distinguish between + * different states of the file, and that are often fooled by multiple + * operations that occur in the same coarse-grained timer tick. + */ +#define I_CTIME_QUERIED ((u32)BIT(31)) + static inline time64_t inode_get_ctime_sec(const struct inode *inode) { return inode->i_ctime_sec; @@ -1660,7 +1671,7 @@ static inline time64_t inode_get_ctime_sec(const struct inode *inode) static inline long inode_get_ctime_nsec(const struct inode *inode) { - return inode->i_ctime_nsec; + return inode->i_ctime_nsec & ~I_CTIME_QUERIED; } static inline struct timespec64 inode_get_ctime(const struct inode *inode) @@ -1671,13 +1682,7 @@ static inline struct timespec64 inode_get_ctime(const struct inode *inode) return ts; } -static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode, - struct timespec64 ts) -{ - inode->i_ctime_sec = ts.tv_sec; - inode->i_ctime_nsec = ts.tv_nsec; - return ts; -} +struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts); /** * inode_set_ctime - set the ctime in the inode @@ -2542,6 +2547,7 @@ struct file_system_type { #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */ #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */ #define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */ +#define FS_MGTIME 64 /* FS uses multigrain timestamps */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ int (*init_fs_context)(struct fs_context *); const struct fs_parameter_spec *parameters; @@ -2565,6 +2571,17 @@ struct file_system_type { #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME) +/** + * is_mgtime: is this inode using multigrain timestamps + * @inode: inode to test for multigrain timestamps + * + * Return true if the inode uses multigrain timestamps, false otherwise. + */ +static inline bool is_mgtime(const struct inode *inode) +{ + return inode->i_sb->s_type->fs_flags & FS_MGTIME; +} + extern struct dentry *mount_bdev(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, int (*fill_super)(struct super_block *, void *, int)); @@ -3321,6 +3338,7 @@ extern void page_put_link(void *); extern int page_symlink(struct inode *inode, const char *symname, int len); extern const struct inode_operations page_symlink_inode_operations; extern void kfree_link(void *); +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode); void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *); void generic_fill_statx_attr(struct inode *inode, struct kstat *stat); void generic_fill_statx_atomic_writes(struct kstat *stat,