diff mbox series

[RFC,1/3,RESEND] fs: add infrastructure for opportunistic high-res ctime/mtime updates

Message ID 20230411143702.64495-2-jlayton@kernel.org (mailing list archive)
State New
Headers show
Series fs: opportunistic high-res file timestamps | expand

Commit Message

Jeff Layton April 11, 2023, 2:37 p.m. UTC
The VFS always uses coarse-grained timestamp updates for filling out the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away metadata updates.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
lot of exported filesystems don't properly support a change attribute
and are subject to the same problem of timestamp granularity. Other
applications have similar issues (e.g backup applications).

Switching to always using high resolution timestamps would improve the
situation for NFS, but that becomes rather expensive, as we'd have to
log a lot more metadata updates.

This patch grabs a new i_state bit to use as a flag that filesystems can
set in their getattr routine to indicate that the mtime or ctime was
queried since it was last updated.

It then adds a new current_cmtime function that acts like the
current_time helper, but will conditionally grab high-res timestamps
when the i_state flag is set in the inode.

This allows NFS and other applications to reap the benefits of high-res
ctime and mtime timestamps, but at a substantially lower cost than
fetching them every time.

Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/inode.c         | 40 ++++++++++++++++++++++++++++++++++++++--
 fs/stat.c          | 10 ++++++++++
 include/linux/fs.h |  5 ++++-
 3 files changed, 52 insertions(+), 3 deletions(-)

Comments

Darrick J. Wong April 11, 2023, 2:42 p.m. UTC | #1
On Tue, Apr 11, 2023 at 10:37:00AM -0400, Jeff Layton wrote:
> The VFS always uses coarse-grained timestamp updates for filling out the
> ctime and mtime after a change. This has the benefit of allowing
> filesystems to optimize away metadata updates.
> 
> Unfortunately, this has always been an issue when we're exporting via
> NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> lot of exported filesystems don't properly support a change attribute
> and are subject to the same problem of timestamp granularity. Other
> applications have similar issues (e.g backup applications).
> 
> Switching to always using high resolution timestamps would improve the
> situation for NFS, but that becomes rather expensive, as we'd have to
> log a lot more metadata updates.
> 
> This patch grabs a new i_state bit to use as a flag that filesystems can
> set in their getattr routine to indicate that the mtime or ctime was
> queried since it was last updated.
> 
> It then adds a new current_cmtime function that acts like the
> current_time helper, but will conditionally grab high-res timestamps
> when the i_state flag is set in the inode.
> 
> This allows NFS and other applications to reap the benefits of high-res
> ctime and mtime timestamps, but at a substantially lower cost than
> fetching them every time.
> 
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/inode.c         | 40 ++++++++++++++++++++++++++++++++++++++--
>  fs/stat.c          | 10 ++++++++++
>  include/linux/fs.h |  5 ++++-
>  3 files changed, 52 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 4558dc2f1355..3630f67fd042 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2062,6 +2062,42 @@ static int __file_update_time(struct file *file, struct timespec64 *now,
>  	return ret;
>  }
>  
> +/**
> + * current_cmtime - Return FS time (possibly high-res)
> + * @inode: inode.
> + *
> + * Return the current time truncated to the time granularity supported by
> + * the fs, as suitable for a ctime or mtime change. If something recently
> + * fetched the ctime or mtime out of the inode via getattr, then get a
> + * high-resolution timestamp.
> + *
> + * Note that inode and inode->sb cannot be NULL.
> + * Otherwise, the function warns and returns coarse time without truncation.
> + */
> +struct timespec64 current_cmtime(struct inode *inode)
> +{
> +	struct timespec64 now;
> +
> +	if (unlikely(!inode->i_sb)) {
> +		WARN(1, "%s() called with uninitialized super_block in the inode", __func__);
> +		ktime_get_coarse_real_ts64(&now);
> +		return now;
> +	}
> +
> +	/* Do a lockless check for the flag before taking the spinlock */
> +	if (READ_ONCE(inode->i_state) & I_CMTIME_QUERIED) {
> +		ktime_get_real_ts64(&now);
> +		spin_lock(&inode->i_lock);
> +		inode->i_state &= ~I_CMTIME_QUERIED;
> +		spin_unlock(&inode->i_lock);
> +	} else {
> +		ktime_get_coarse_real_ts64(&now);
> +	}
> +
> +	return timestamp_truncate(now, inode);

I wonder, under which conditions (arch+fs) would it be worth the effort
to check s_time_gran as part of deciding whether or not to sample a high
res timestamp?

I suppose that would only help us for the situation where "ktime
sampling is not fast" and "fs timestamp granularity is awful"?

(Mechanically, the function body looks ok to me...)

--D

> +}
> +EXPORT_SYMBOL(current_cmtime);
> +
>  /**
>   * file_update_time - update mtime and ctime time
>   * @file: file accessed
> @@ -2080,7 +2116,7 @@ int file_update_time(struct file *file)
>  {
>  	int ret;
>  	struct inode *inode = file_inode(file);
> -	struct timespec64 now = current_time(inode);
> +	struct timespec64 now = current_cmtime(inode);
>  
>  	ret = inode_needs_update_time(inode, &now);
>  	if (ret <= 0)
> @@ -2109,7 +2145,7 @@ static int file_modified_flags(struct file *file, int flags)
>  {
>  	int ret;
>  	struct inode *inode = file_inode(file);
> -	struct timespec64 now = current_time(inode);
> +	struct timespec64 now = current_cmtime(inode);
>  
>  	/*
>  	 * Clear the security bits if the process is not being run by root.
> diff --git a/fs/stat.c b/fs/stat.c
> index 7c238da22ef0..d8b80a2e36b7 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -64,6 +64,16 @@ void generic_fillattr(struct mnt_idmap *idmap, struct inode *inode,
>  }
>  EXPORT_SYMBOL(generic_fillattr);
>  
> +void fill_cmtime_and_mark(struct inode *inode, struct kstat *stat)
> +{
> +	spin_lock(&inode->i_lock);
> +	inode->i_state |= I_CMTIME_QUERIED;
> +	stat->ctime = inode->i_ctime;
> +	stat->mtime = inode->i_mtime;
> +	spin_unlock(&inode->i_lock);
> +}
> +EXPORT_SYMBOL(fill_cmtime_and_mark);
> +
>  /**
>   * generic_fill_statx_attr - Fill in the statx attributes from the inode flags
>   * @inode:	Inode to use as the source
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c85916e9f7db..7dece4390979 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1457,7 +1457,8 @@ static inline bool fsuidgid_has_mapping(struct super_block *sb,
>  	       kgid_has_mapping(fs_userns, kgid);
>  }
>  
> -extern struct timespec64 current_time(struct inode *inode);
> +struct timespec64 current_time(struct inode *inode);
> +struct timespec64 current_cmtime(struct inode *inode);
>  
>  /*
>   * Snapshotting support.
> @@ -2116,6 +2117,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
>  #define I_DONTCACHE		(1 << 16)
>  #define I_SYNC_QUEUED		(1 << 17)
>  #define I_PINNING_FSCACHE_WB	(1 << 18)
> +#define I_CMTIME_QUERIED	(1 << 19)
>  
>  #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
>  #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
> @@ -2839,6 +2841,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len);
>  extern const struct inode_operations page_symlink_inode_operations;
>  extern void kfree_link(void *);
>  void generic_fillattr(struct mnt_idmap *, struct inode *, struct kstat *);
> +void fill_cmtime_and_mark(struct inode *inode, struct kstat *stat);
>  void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
>  extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
>  extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int);
> -- 
> 2.39.2
>
Jeff Layton April 11, 2023, 2:54 p.m. UTC | #2
On Tue, 2023-04-11 at 07:42 -0700, Darrick J. Wong wrote:
> On Tue, Apr 11, 2023 at 10:37:00AM -0400, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamp updates for filling out the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away metadata updates.
> > 
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > lot of exported filesystems don't properly support a change attribute
> > and are subject to the same problem of timestamp granularity. Other
> > applications have similar issues (e.g backup applications).
> > 
> > Switching to always using high resolution timestamps would improve the
> > situation for NFS, but that becomes rather expensive, as we'd have to
> > log a lot more metadata updates.
> > 
> > This patch grabs a new i_state bit to use as a flag that filesystems can
> > set in their getattr routine to indicate that the mtime or ctime was
> > queried since it was last updated.
> > 
> > It then adds a new current_cmtime function that acts like the
> > current_time helper, but will conditionally grab high-res timestamps
> > when the i_state flag is set in the inode.
> > 
> > This allows NFS and other applications to reap the benefits of high-res
> > ctime and mtime timestamps, but at a substantially lower cost than
> > fetching them every time.
> > 
> > Cc: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/inode.c         | 40 ++++++++++++++++++++++++++++++++++++++--
> >  fs/stat.c          | 10 ++++++++++
> >  include/linux/fs.h |  5 ++++-
> >  3 files changed, 52 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 4558dc2f1355..3630f67fd042 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -2062,6 +2062,42 @@ static int __file_update_time(struct file *file, struct timespec64 *now,
> >  	return ret;
> >  }
> >  
> > +/**
> > + * current_cmtime - Return FS time (possibly high-res)
> > + * @inode: inode.
> > + *
> > + * Return the current time truncated to the time granularity supported by
> > + * the fs, as suitable for a ctime or mtime change. If something recently
> > + * fetched the ctime or mtime out of the inode via getattr, then get a
> > + * high-resolution timestamp.
> > + *
> > + * Note that inode and inode->sb cannot be NULL.
> > + * Otherwise, the function warns and returns coarse time without truncation.
> > + */
> > +struct timespec64 current_cmtime(struct inode *inode)
> > +{
> > +	struct timespec64 now;
> > +
> > +	if (unlikely(!inode->i_sb)) {
> > +		WARN(1, "%s() called with uninitialized super_block in the inode", __func__);
> > +		ktime_get_coarse_real_ts64(&now);
> > +		return now;
> > +	}
> > +
> > +	/* Do a lockless check for the flag before taking the spinlock */
> > +	if (READ_ONCE(inode->i_state) & I_CMTIME_QUERIED) {
> > +		ktime_get_real_ts64(&now);
> > +		spin_lock(&inode->i_lock);
> > +		inode->i_state &= ~I_CMTIME_QUERIED;
> > +		spin_unlock(&inode->i_lock);
> > +	} else {
> > +		ktime_get_coarse_real_ts64(&now);
> > +	}
> > +
> > +	return timestamp_truncate(now, inode);
> 
> I wonder, under which conditions (arch+fs) would it be worth the effort
> to check s_time_gran as part of deciding whether or not to sample a high
> res timestamp?
> 
> I suppose that would only help us for the situation where "ktime
> sampling is not fast" and "fs timestamp granularity is awful"?
> 
> (Mechanically, the function body looks ok to me...)
> 

Thanks for looking! Yeah, that is a good point. No reason to fetch a
high res value if we can't store it anyway. That shouldn't be hard to
optimize away. I'll plan to have a look at that.



> > +}
> > +EXPORT_SYMBOL(current_cmtime);
> > +
> >  /**
> >   * file_update_time - update mtime and ctime time
> >   * @file: file accessed
> > @@ -2080,7 +2116,7 @@ int file_update_time(struct file *file)
> >  {
> >  	int ret;
> >  	struct inode *inode = file_inode(file);
> > -	struct timespec64 now = current_time(inode);
> > +	struct timespec64 now = current_cmtime(inode);
> >  
> >  	ret = inode_needs_update_time(inode, &now);
> >  	if (ret <= 0)
> > @@ -2109,7 +2145,7 @@ static int file_modified_flags(struct file *file, int flags)
> >  {
> >  	int ret;
> >  	struct inode *inode = file_inode(file);
> > -	struct timespec64 now = current_time(inode);
> > +	struct timespec64 now = current_cmtime(inode);
> >  
> >  	/*
> >  	 * Clear the security bits if the process is not being run by root.
> > diff --git a/fs/stat.c b/fs/stat.c
> > index 7c238da22ef0..d8b80a2e36b7 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -64,6 +64,16 @@ void generic_fillattr(struct mnt_idmap *idmap, struct inode *inode,
> >  }
> >  EXPORT_SYMBOL(generic_fillattr);
> >  
> > +void fill_cmtime_and_mark(struct inode *inode, struct kstat *stat)
> > +{
> > +	spin_lock(&inode->i_lock);
> > +	inode->i_state |= I_CMTIME_QUERIED;
> > +	stat->ctime = inode->i_ctime;
> > +	stat->mtime = inode->i_mtime;
> > +	spin_unlock(&inode->i_lock);
> > +}
> > +EXPORT_SYMBOL(fill_cmtime_and_mark);
> > +
> >  /**
> >   * generic_fill_statx_attr - Fill in the statx attributes from the inode flags
> >   * @inode:	Inode to use as the source
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index c85916e9f7db..7dece4390979 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1457,7 +1457,8 @@ static inline bool fsuidgid_has_mapping(struct super_block *sb,
> >  	       kgid_has_mapping(fs_userns, kgid);
> >  }
> >  
> > -extern struct timespec64 current_time(struct inode *inode);
> > +struct timespec64 current_time(struct inode *inode);
> > +struct timespec64 current_cmtime(struct inode *inode);
> >  
> >  /*
> >   * Snapshotting support.
> > @@ -2116,6 +2117,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
> >  #define I_DONTCACHE		(1 << 16)
> >  #define I_SYNC_QUEUED		(1 << 17)
> >  #define I_PINNING_FSCACHE_WB	(1 << 18)
> > +#define I_CMTIME_QUERIED	(1 << 19)
> >  
> >  #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> >  #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
> > @@ -2839,6 +2841,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len);
> >  extern const struct inode_operations page_symlink_inode_operations;
> >  extern void kfree_link(void *);
> >  void generic_fillattr(struct mnt_idmap *, struct inode *, struct kstat *);
> > +void fill_cmtime_and_mark(struct inode *inode, struct kstat *stat);
> >  void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
> >  extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
> >  extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int);
> > -- 
> > 2.39.2
> >
Christian Brauner April 11, 2023, 3:07 p.m. UTC | #3
On Tue, Apr 11, 2023 at 10:37:00AM -0400, Jeff Layton wrote:
> The VFS always uses coarse-grained timestamp updates for filling out the
> ctime and mtime after a change. This has the benefit of allowing
> filesystems to optimize away metadata updates.
> 
> Unfortunately, this has always been an issue when we're exporting via
> NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> lot of exported filesystems don't properly support a change attribute
> and are subject to the same problem of timestamp granularity. Other
> applications have similar issues (e.g backup applications).
> 
> Switching to always using high resolution timestamps would improve the
> situation for NFS, but that becomes rather expensive, as we'd have to
> log a lot more metadata updates.
> 
> This patch grabs a new i_state bit to use as a flag that filesystems can
> set in their getattr routine to indicate that the mtime or ctime was
> queried since it was last updated.
> 
> It then adds a new current_cmtime function that acts like the
> current_time helper, but will conditionally grab high-res timestamps
> when the i_state flag is set in the inode.
> 
> This allows NFS and other applications to reap the benefits of high-res
> ctime and mtime timestamps, but at a substantially lower cost than
> fetching them every time.
> 
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/inode.c         | 40 ++++++++++++++++++++++++++++++++++++++--
>  fs/stat.c          | 10 ++++++++++
>  include/linux/fs.h |  5 ++++-
>  3 files changed, 52 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 4558dc2f1355..3630f67fd042 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2062,6 +2062,42 @@ static int __file_update_time(struct file *file, struct timespec64 *now,
>  	return ret;
>  }
>  
> +/**
> + * current_cmtime - Return FS time (possibly high-res)
> + * @inode: inode.
> + *
> + * Return the current time truncated to the time granularity supported by
> + * the fs, as suitable for a ctime or mtime change. If something recently
> + * fetched the ctime or mtime out of the inode via getattr, then get a
> + * high-resolution timestamp.
> + *
> + * Note that inode and inode->sb cannot be NULL.
> + * Otherwise, the function warns and returns coarse time without truncation.
> + */
> +struct timespec64 current_cmtime(struct inode *inode)
> +{
> +	struct timespec64 now;
> +
> +	if (unlikely(!inode->i_sb)) {
> +		WARN(1, "%s() called with uninitialized super_block in the inode", __func__);

How would this happen? Seems weird to even bother checking this.

> +		ktime_get_coarse_real_ts64(&now);
> +		return now;
> +	}
> +
> +	/* Do a lockless check for the flag before taking the spinlock */
> +	if (READ_ONCE(inode->i_state) & I_CMTIME_QUERIED) {
> +		ktime_get_real_ts64(&now);
> +		spin_lock(&inode->i_lock);
> +		inode->i_state &= ~I_CMTIME_QUERIED;
> +		spin_unlock(&inode->i_lock);
> +	} else {
> +		ktime_get_coarse_real_ts64(&now);
> +	}
> +
> +	return timestamp_truncate(now, inode);
> +}
> +EXPORT_SYMBOL(current_cmtime);
> +
>  /**
>   * file_update_time - update mtime and ctime time
>   * @file: file accessed
> @@ -2080,7 +2116,7 @@ int file_update_time(struct file *file)
>  {
>  	int ret;
>  	struct inode *inode = file_inode(file);
> -	struct timespec64 now = current_time(inode);
> +	struct timespec64 now = current_cmtime(inode);
>  
>  	ret = inode_needs_update_time(inode, &now);
>  	if (ret <= 0)
> @@ -2109,7 +2145,7 @@ static int file_modified_flags(struct file *file, int flags)
>  {
>  	int ret;
>  	struct inode *inode = file_inode(file);
> -	struct timespec64 now = current_time(inode);
> +	struct timespec64 now = current_cmtime(inode);
>  
>  	/*
>  	 * Clear the security bits if the process is not being run by root.
> diff --git a/fs/stat.c b/fs/stat.c
> index 7c238da22ef0..d8b80a2e36b7 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -64,6 +64,16 @@ void generic_fillattr(struct mnt_idmap *idmap, struct inode *inode,
>  }
>  EXPORT_SYMBOL(generic_fillattr);
>  
> +void fill_cmtime_and_mark(struct inode *inode, struct kstat *stat)
> +{
> +	spin_lock(&inode->i_lock);
> +	inode->i_state |= I_CMTIME_QUERIED;
> +	stat->ctime = inode->i_ctime;
> +	stat->mtime = inode->i_mtime;
> +	spin_unlock(&inode->i_lock);
> +}
> +EXPORT_SYMBOL(fill_cmtime_and_mark);

So that means that each stat call would mark an inode for a
high-resolution update. There's some performance concerns here. Calling
stat() is super common and it would potentially make the next iop more
expensive. Recursively changing ownership in the container use-case come
to mind which are already expensive.
Jeff Layton April 11, 2023, 4:04 p.m. UTC | #4
On Tue, 2023-04-11 at 17:07 +0200, Christian Brauner wrote:
> On Tue, Apr 11, 2023 at 10:37:00AM -0400, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamp updates for filling out the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away metadata updates.
> > 
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > lot of exported filesystems don't properly support a change attribute
> > and are subject to the same problem of timestamp granularity. Other
> > applications have similar issues (e.g backup applications).
> > 
> > Switching to always using high resolution timestamps would improve the
> > situation for NFS, but that becomes rather expensive, as we'd have to
> > log a lot more metadata updates.
> > 
> > This patch grabs a new i_state bit to use as a flag that filesystems can
> > set in their getattr routine to indicate that the mtime or ctime was
> > queried since it was last updated.
> > 
> > It then adds a new current_cmtime function that acts like the
> > current_time helper, but will conditionally grab high-res timestamps
> > when the i_state flag is set in the inode.
> > 
> > This allows NFS and other applications to reap the benefits of high-res
> > ctime and mtime timestamps, but at a substantially lower cost than
> > fetching them every time.
> > 
> > Cc: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/inode.c         | 40 ++++++++++++++++++++++++++++++++++++++--
> >  fs/stat.c          | 10 ++++++++++
> >  include/linux/fs.h |  5 ++++-
> >  3 files changed, 52 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 4558dc2f1355..3630f67fd042 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -2062,6 +2062,42 @@ static int __file_update_time(struct file *file, struct timespec64 *now,
> >  	return ret;
> >  }
> >  
> > +/**
> > + * current_cmtime - Return FS time (possibly high-res)
> > + * @inode: inode.
> > + *
> > + * Return the current time truncated to the time granularity supported by
> > + * the fs, as suitable for a ctime or mtime change. If something recently
> > + * fetched the ctime or mtime out of the inode via getattr, then get a
> > + * high-resolution timestamp.
> > + *
> > + * Note that inode and inode->sb cannot be NULL.
> > + * Otherwise, the function warns and returns coarse time without truncation.
> > + */
> > +struct timespec64 current_cmtime(struct inode *inode)
> > +{
> > +	struct timespec64 now;
> > +
> > +	if (unlikely(!inode->i_sb)) {
> > +		WARN(1, "%s() called with uninitialized super_block in the inode", __func__);
> 
> How would this happen? Seems weird to even bother checking this.
> 

Agreed. I copied this from current_time. I'm fine with leaving that out.
Maybe we should remove it from current_time as well?

> > +		ktime_get_coarse_real_ts64(&now);
> > +		return now;
> > +	}
> > +
> > +	/* Do a lockless check for the flag before taking the spinlock */
> > +	if (READ_ONCE(inode->i_state) & I_CMTIME_QUERIED) {
> > +		ktime_get_real_ts64(&now);
> > +		spin_lock(&inode->i_lock);
> > +		inode->i_state &= ~I_CMTIME_QUERIED;
> > +		spin_unlock(&inode->i_lock);
> > +	} else {
> > +		ktime_get_coarse_real_ts64(&now);
> > +	}
> > +
> > +	return timestamp_truncate(now, inode);
> > +}
> > +EXPORT_SYMBOL(current_cmtime);
> > +
> >  /**
> >   * file_update_time - update mtime and ctime time
> >   * @file: file accessed
> > @@ -2080,7 +2116,7 @@ int file_update_time(struct file *file)
> >  {
> >  	int ret;
> >  	struct inode *inode = file_inode(file);
> > -	struct timespec64 now = current_time(inode);
> > +	struct timespec64 now = current_cmtime(inode);
> >  
> >  	ret = inode_needs_update_time(inode, &now);
> >  	if (ret <= 0)
> > @@ -2109,7 +2145,7 @@ static int file_modified_flags(struct file *file, int flags)
> >  {
> >  	int ret;
> >  	struct inode *inode = file_inode(file);
> > -	struct timespec64 now = current_time(inode);
> > +	struct timespec64 now = current_cmtime(inode);
> >  
> >  	/*
> >  	 * Clear the security bits if the process is not being run by root.
> > diff --git a/fs/stat.c b/fs/stat.c
> > index 7c238da22ef0..d8b80a2e36b7 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -64,6 +64,16 @@ void generic_fillattr(struct mnt_idmap *idmap, struct inode *inode,
> >  }
> >  EXPORT_SYMBOL(generic_fillattr);
> >  
> > +void fill_cmtime_and_mark(struct inode *inode, struct kstat *stat)
> > +{
> > +	spin_lock(&inode->i_lock);
> > +	inode->i_state |= I_CMTIME_QUERIED;
> > +	stat->ctime = inode->i_ctime;
> > +	stat->mtime = inode->i_mtime;
> > +	spin_unlock(&inode->i_lock);
> > +}
> > +EXPORT_SYMBOL(fill_cmtime_and_mark);
> 
> So that means that each stat call would mark an inode for a
> high-resolution update.
> 

Yep. At least any statx call with STATX_CTIME|STATX_MTIME set (which
includes legacy stat() calls of course).

> There's some performance concerns here. Calling
> stat() is super common and it would potentially make the next iop more
> expensive. Recursively changing ownership in the container use-case come
> to mind which are already expensive.

stat() is common, but not generally as common as write calls are. I
expect that we'll get somewhat similar results tochanged i_version over
to use a similar QUERIED flag.

The i_version field was originally very expensive and required metadata
updates on every write. After making that change, we got the same
performance back in most tests that we got without the i_version field
being enabled at all. Basically, this just means we'll end up logging an
extra journal transaction on some writes that follow a stat() call,
which turns out to be line noise for most workloads.

I do agree that performance is a concern here though. We'll need to
benchmark this somehow.
Jan Kara April 21, 2023, 10:23 a.m. UTC | #5
On Tue 11-04-23 12:04:36, Jeff Layton wrote:
> On Tue, 2023-04-11 at 17:07 +0200, Christian Brauner wrote:
> > On Tue, Apr 11, 2023 at 10:37:00AM -0400, Jeff Layton wrote:
> > There's some performance concerns here. Calling
> > stat() is super common and it would potentially make the next iop more
> > expensive. Recursively changing ownership in the container use-case come
> > to mind which are already expensive.
> 
> stat() is common, but not generally as common as write calls are. I
> expect that we'll get somewhat similar results tochanged i_version over
> to use a similar QUERIED flag.
> 
> The i_version field was originally very expensive and required metadata
> updates on every write. After making that change, we got the same
> performance back in most tests that we got without the i_version field
> being enabled at all. Basically, this just means we'll end up logging an
> extra journal transaction on some writes that follow a stat() call,
> which turns out to be line noise for most workloads.
> 
> I do agree that performance is a concern here though. We'll need to
> benchmark this somehow.

So for stat-intensive read-only workloads the additional inode_lock locking
during stat may be noticeable. I suppose a stress test stating the same
file in a loop from all CPUs the machine has will certainly notice :) But
that's just an unrealistic worst case.

We could check whether the QUERIED flag is already set and if yes, skip the
locking. That should fix the read-only workload case. We just have to think
whether there would not be some unpleasant races created.

								Honza
diff mbox series

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 4558dc2f1355..3630f67fd042 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2062,6 +2062,42 @@  static int __file_update_time(struct file *file, struct timespec64 *now,
 	return ret;
 }
 
+/**
+ * current_cmtime - Return FS time (possibly high-res)
+ * @inode: inode.
+ *
+ * Return the current time truncated to the time granularity supported by
+ * the fs, as suitable for a ctime or mtime change. If something recently
+ * fetched the ctime or mtime out of the inode via getattr, then get a
+ * high-resolution timestamp.
+ *
+ * Note that inode and inode->sb cannot be NULL.
+ * Otherwise, the function warns and returns coarse time without truncation.
+ */
+struct timespec64 current_cmtime(struct inode *inode)
+{
+	struct timespec64 now;
+
+	if (unlikely(!inode->i_sb)) {
+		WARN(1, "%s() called with uninitialized super_block in the inode", __func__);
+		ktime_get_coarse_real_ts64(&now);
+		return now;
+	}
+
+	/* Do a lockless check for the flag before taking the spinlock */
+	if (READ_ONCE(inode->i_state) & I_CMTIME_QUERIED) {
+		ktime_get_real_ts64(&now);
+		spin_lock(&inode->i_lock);
+		inode->i_state &= ~I_CMTIME_QUERIED;
+		spin_unlock(&inode->i_lock);
+	} else {
+		ktime_get_coarse_real_ts64(&now);
+	}
+
+	return timestamp_truncate(now, inode);
+}
+EXPORT_SYMBOL(current_cmtime);
+
 /**
  * file_update_time - update mtime and ctime time
  * @file: file accessed
@@ -2080,7 +2116,7 @@  int file_update_time(struct file *file)
 {
 	int ret;
 	struct inode *inode = file_inode(file);
-	struct timespec64 now = current_time(inode);
+	struct timespec64 now = current_cmtime(inode);
 
 	ret = inode_needs_update_time(inode, &now);
 	if (ret <= 0)
@@ -2109,7 +2145,7 @@  static int file_modified_flags(struct file *file, int flags)
 {
 	int ret;
 	struct inode *inode = file_inode(file);
-	struct timespec64 now = current_time(inode);
+	struct timespec64 now = current_cmtime(inode);
 
 	/*
 	 * Clear the security bits if the process is not being run by root.
diff --git a/fs/stat.c b/fs/stat.c
index 7c238da22ef0..d8b80a2e36b7 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -64,6 +64,16 @@  void generic_fillattr(struct mnt_idmap *idmap, struct inode *inode,
 }
 EXPORT_SYMBOL(generic_fillattr);
 
+void fill_cmtime_and_mark(struct inode *inode, struct kstat *stat)
+{
+	spin_lock(&inode->i_lock);
+	inode->i_state |= I_CMTIME_QUERIED;
+	stat->ctime = inode->i_ctime;
+	stat->mtime = inode->i_mtime;
+	spin_unlock(&inode->i_lock);
+}
+EXPORT_SYMBOL(fill_cmtime_and_mark);
+
 /**
  * generic_fill_statx_attr - Fill in the statx attributes from the inode flags
  * @inode:	Inode to use as the source
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..7dece4390979 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1457,7 +1457,8 @@  static inline bool fsuidgid_has_mapping(struct super_block *sb,
 	       kgid_has_mapping(fs_userns, kgid);
 }
 
-extern struct timespec64 current_time(struct inode *inode);
+struct timespec64 current_time(struct inode *inode);
+struct timespec64 current_cmtime(struct inode *inode);
 
 /*
  * Snapshotting support.
@@ -2116,6 +2117,7 @@  static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 #define I_DONTCACHE		(1 << 16)
 #define I_SYNC_QUEUED		(1 << 17)
 #define I_PINNING_FSCACHE_WB	(1 << 18)
+#define I_CMTIME_QUERIED	(1 << 19)
 
 #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
 #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
@@ -2839,6 +2841,7 @@  extern int page_symlink(struct inode *inode, const char *symname, int len);
 extern const struct inode_operations page_symlink_inode_operations;
 extern void kfree_link(void *);
 void generic_fillattr(struct mnt_idmap *, struct inode *, struct kstat *);
+void fill_cmtime_and_mark(struct inode *inode, struct kstat *stat);
 void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
 extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
 extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int);