diff mbox

[v4,19/19] fs: handle inode->i_version more efficiently

Message ID 20171222120556.7435-20-jlayton@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Jeff Layton Dec. 22, 2017, 12:05 p.m. UTC
From: Jeff Layton <jlayton@redhat.com>

Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.

Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.

When we go to maybe increment it, we fetch the value and check the flag
bit.  If it's clear then we don't need to do anything if the update
isn't being forced.

If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.

On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.

This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/fs.h       |   2 +-
 include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
 2 files changed, 154 insertions(+), 56 deletions(-)

Comments

Jan Kara Jan. 2, 2018, 5 p.m. UTC | #1
On Fri 22-12-17 07:05:56, Jeff Layton wrote:
> From: Jeff Layton <jlayton@redhat.com>
> 
> Since i_version is mostly treated as an opaque value, we can exploit that
> fact to avoid incrementing it when no one is watching. With that change,
> we can avoid incrementing the counter on writes, unless someone has
> queried for it since it was last incremented. If the a/c/mtime don't
> change, and the i_version hasn't changed, then there's no need to dirty
> the inode metadata on a write.
> 
> Convert the i_version counter to an atomic64_t, and use the lowest order
> bit to hold a flag that will tell whether anyone has queried the value
> since it was last incremented.
> 
> When we go to maybe increment it, we fetch the value and check the flag
> bit.  If it's clear then we don't need to do anything if the update
> isn't being forced.
> 
> If we do need to update, then we increment the counter by 2, and clear
> the flag bit, and then use a CAS op to swap it into place. If that
> works, we return true. If it doesn't then do it again with the value
> that we fetch from the CAS operation.
> 
> On the query side, if the flag is already set, then we just shift the
> value down by 1 bit and return it. Otherwise, we set the flag in our
> on-stack value and again use cmpxchg to swap it into place if it hasn't
> changed. If it has, then we use the value from the cmpxchg as the new
> "old" value and try again.
> 
> This method allows us to avoid incrementing the counter on writes (and
> dirtying the metadata) under typical workloads. We only need to increment
> if it has been queried since it was last changed.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/fs.h       |   2 +-
>  include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
>  2 files changed, 154 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 76382c24e9d0..6804d075933e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -639,7 +639,7 @@ struct inode {
>  		struct hlist_head	i_dentry;
>  		struct rcu_head		i_rcu;
>  	};
> -	u64			i_version;
> +	atomic64_t		i_version;
>  	atomic_t		i_count;
>  	atomic_t		i_dio_count;
>  	atomic_t		i_writecount;
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index e08c634779df..cef242e54489 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -5,6 +5,8 @@
>  #include <linux/fs.h>
>  
>  /*
> + * The inode->i_version field:
> + * ---------------------------
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
>   * appear different to observers if there was a change to the inode's data or
> @@ -27,86 +29,171 @@
>   * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
>   * We consider these sorts of filesystems to have a kernel-managed i_version.
>   *
> + * This implementation uses the low bit in the i_version field as a flag to
> + * track when the value has been queried. If it has not been queried since it
> + * was last incremented, we can skip the increment in most cases.
> + *
> + * In the event that we're updating the ctime, we will usually go ahead and
> + * bump the i_version anyway. Since that has to go to stable storage in some
> + * fashion, we might as well increment it as well.
> + *
> + * With this implementation, the value should always appear to observers to
> + * increase over time if the file has changed. It's recommended to use
> + * inode_cmp_iversion() helper to compare values.
> + *
>   * Note that some filesystems (e.g. NFS and AFS) just use the field to store
> - * a server-provided value (for the most part). For that reason, those
> + * a server-provided value for the most part. For that reason, those
>   * filesystems do not set SB_I_VERSION. These filesystems are considered to
>   * have a self-managed i_version.
> + *
> + * Persistently storing the i_version
> + * ----------------------------------
> + * Queries of the i_version field are not gated on them hitting the backing
> + * store. It's always possible that the host could crash after allowing
> + * a query of the value but before it has made it to disk.
> + *
> + * To mitigate this problem, filesystems should always use
> + * inode_set_iversion_queried when loading an existing inode from disk. This
> + * ensures that the next attempted inode increment will result in the value
> + * changing.
> + *
> + * Storing the value to disk therefore does not count as a query, so those
> + * filesystems should use inode_peek_iversion to grab the value to be stored.
> + * There is no need to flag the value as having been queried in that case.
>   */
>  
> +/*
> + * We borrow the lowest bit in the i_version to use as a flag to tell whether
> + * it has been queried since we last incremented it. If it has, then we must
> + * increment it on the next change. After that, we can clear the flag and
> + * avoid incrementing it again until it has again been queried.
> + */
> +#define I_VERSION_QUERIED_SHIFT	(1)
> +#define I_VERSION_QUERIED	(1ULL << (I_VERSION_QUERIED_SHIFT - 1))
> +#define I_VERSION_INCREMENT	(1ULL << I_VERSION_QUERIED_SHIFT)
> +
>  /**
>   * inode_set_iversion_raw - set i_version to the specified raw value
>   * @inode: inode to set
> - * @new: new i_version value to set
> + * @val: new i_version value to set
>   *
> - * Set @inode's i_version field to @new. This function is for use by
> + * Set @inode's i_version field to @val. This function is for use by
>   * filesystems that self-manage the i_version.
>   *
>   * For example, the NFS client stores its NFSv4 change attribute in this way,
>   * and the AFS client stores the data_version from the server here.
>   */
>  static inline void
> -inode_set_iversion_raw(struct inode *inode, const u64 new)
> +inode_set_iversion_raw(struct inode *inode, const u64 val)
> +{
> +	atomic64_set(&inode->i_version, val);
> +}
> +
> +/**
> + * inode_peek_iversion_raw - grab a "raw" iversion value
> + * @inode: inode from which i_version should be read
> + *
> + * Grab a "raw" inode->i_version value and return it. The i_version is not
> + * flagged or converted in any way. This is mostly used to access a self-managed
> + * i_version.
> + *
> + * With those filesystems, we want to treat the i_version as an entirely
> + * opaque value.
> + */
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
>  {
> -	inode->i_version = new;
> +	return atomic64_read(&inode->i_version);
>  }
>  
>  /**
>   * inode_set_iversion - set i_version to a particular value
>   * @inode: inode to set
> - * @new: new i_version value to set
> + * @val: new i_version value to set
>   *
> - * Set @inode's i_version field to @new. This function is for filesystems with
> - * a kernel-managed i_version.
> + * Set @inode's i_version field to @val. This function is for filesystems with
> + * a kernel-managed i_version, for initializing a newly-created inode from
> + * scratch.
>   *
> - * For now, this just does the same thing as the _raw variant.
> + * In this case, we do not set the QUERIED flag since we know that this value
> + * has never been queried.
>   */
>  static inline void
> -inode_set_iversion(struct inode *inode, const u64 new)
> +inode_set_iversion(struct inode *inode, const u64 val)
>  {
> -	inode_set_iversion_raw(inode, new);
> +	inode_set_iversion_raw(inode, val << I_VERSION_QUERIED_SHIFT);
>  }
>  
>  /**
> - * inode_set_iversion_queried - set i_version to a particular value and set
> - *                              flag to indicate that it has been viewed
> + * inode_set_iversion_queried - set i_version to a particular value as quereied
>   * @inode: inode to set
> - * @new: new i_version value to set
> + * @val: new i_version value to set
>   *
> - * When loading in an i_version value from a backing store, we typically don't
> - * know whether it was previously viewed before being stored or not. Thus, we
> - * must assume that it was, to ensure that any changes will result in the
> - * value changing.
> + * Set @inode's i_version field to @val, and flag it for increment on the next
> + * change.
>   *
> - * This function will set the inode's i_version, and possibly flag the value
> - * as if it has already been viewed at least once.
> + * Filesystems that persistently store the i_version on disk should use this
> + * when loading an existing inode from disk.
>   *
> - * For now, this just does what inode_set_iversion does.
> + * When loading in an i_version value from a backing store, we can't be certain
> + * that it wasn't previously viewed before being stored. Thus, we must assume
> + * that it was, to ensure that we don't end up handing out the same value for
> + * different versions of the same inode.
>   */
>  static inline void
> -inode_set_iversion_queried(struct inode *inode, const u64 new)
> +inode_set_iversion_queried(struct inode *inode, const u64 val)
>  {
> -	inode_set_iversion(inode, new);
> +	inode_set_iversion_raw(inode, (val << I_VERSION_QUERIED_SHIFT) |
> +				I_VERSION_QUERIED);
>  }
>  
>  /**
>   * inode_maybe_inc_iversion - increments i_version
>   * @inode: inode with the i_version that should be updated
> - * @force: increment the counter even if it's not necessary
> + * @force: increment the counter even if it's not necessary?
>   *
>   * Every time the inode is modified, the i_version field must be seen to have
>   * changed by any observer.
>   *
> - * In this implementation, we always increment it after taking the i_lock to
> - * ensure that we don't race with other incrementors.
> + * If "force" is set or the QUERIED flag is set, then ensure that we increment
> + * the value, and clear the queried flag.
>   *
> - * Returns true if counter was bumped, and false if it wasn't.
> + * In the common case where neither is set, then we can return "false" without
> + * updating i_version.
> + *
> + * If this function returns false, and no other metadata has changed, then we
> + * can avoid logging the metadata.
>   */
>  static inline bool
>  inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
> -	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> +	u64 cur, old, new;
> +
> +	/*
> +	 * The i_version field is not strictly ordered with any other inode
> +	 * information, but the legacy inode_inc_iversion code used a spinlock
> +	 * to serialize increments.
> +	 *
> +	 * Here, we add full memory barriers to ensure that any de-facto
> +	 * ordering with other info is preserved.
> +	 *
> +	 * This barrier pairs with the barrier in inode_query_iversion()
> +	 */
> +	smp_mb();
> +	cur = inode_peek_iversion_raw(inode);
> +	for (;;) {
> +		/* If flag is clear then we needn't do anything */
> +		if (!force && !(cur & I_VERSION_QUERIED))
> +			return false;
>  
> -	atomic64_inc(ivp);
> +		/* Since lowest bit is flag, add 2 to avoid it */
> +		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> +
> +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> +		if (likely(old == cur))
> +			break;
> +		cur = old;
> +	}
>  	return true;
>  }
>  
> @@ -129,31 +216,12 @@ inode_inc_iversion(struct inode *inode)
>   * @inode: inode to check
>   *
>   * Returns whether the inode->i_version counter needs incrementing on the next
> - * change.
> - *
> - * For now, we assume that it always does.
> + * change. Just fetch the value and check the QUERIED flag.
>   */
>  static inline bool
>  inode_iversion_need_inc(struct inode *inode)
>  {
> -	return true;
> -}
> -
> -/**
> - * inode_peek_iversion_raw - grab a "raw" iversion value
> - * @inode: inode from which i_version should be read
> - *
> - * Grab a "raw" inode->i_version value and return it. The i_version is not
> - * flagged or converted in any way. This is mostly used to access a self-managed
> - * i_version.
> - *
> - * With those filesystems, we want to treat the i_version as an entirely
> - * opaque value.
> - */
> -static inline u64
> -inode_peek_iversion_raw(const struct inode *inode)
> -{
> -	return inode->i_version;
> +	return inode_peek_iversion_raw(inode) & I_VERSION_QUERIED;
>  }
>  
>  /**
> @@ -170,7 +238,7 @@ inode_peek_iversion_raw(const struct inode *inode)
>  static inline u64
>  inode_peek_iversion(const struct inode *inode)
>  {
> -	return inode_peek_iversion_raw(inode);
> +	return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT;
>  }
>  
>  /**
> @@ -182,12 +250,35 @@ inode_peek_iversion(const struct inode *inode)
>   * that a later query of the i_version will result in a different value if
>   * anything has changed.
>   *
> - * This implementation just does a peek.
> + * In this implementation, we fetch the current value, set the QUERIED flag and
> + * then try to swap it into place with a cmpxchg, if it wasn't already set. If
> + * that fails, we try again with the newly fetched value from the cmpxchg.
>   */
>  static inline u64
>  inode_query_iversion(struct inode *inode)
>  {
> -	return inode_peek_iversion(inode);
> +	u64 cur, old, new;
> +
> +	cur = inode_peek_iversion_raw(inode);
> +	for (;;) {
> +		/* If flag is already set, then no need to swap */
> +		if (cur & I_VERSION_QUERIED) {
> +			/*
> +			 * This barrier (and the implicit barrier in the
> +			 * cmpxchg below) pairs with the barrier in
> +			 * inode_maybe_inc_iversion().
> +			 */
> +			smp_mb();
> +			break;
> +		}
> +
> +		new = cur | I_VERSION_QUERIED;
> +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> +		if (likely(old == cur))
> +			break;
> +		cur = old;
> +	}
> +	return cur >> I_VERSION_QUERIED_SHIFT;
>  }
>  
>  /**
> @@ -196,11 +287,18 @@ inode_query_iversion(struct inode *inode)
>   * @old: old value to check against its i_version
>   *
>   * Compare an i_version counter with a previous one. Returns 0 if they are
> - * the same or non-zero if they are different.
> + * the same, a positive value if the one in the inode appears newer than @old,
> + * and a negative value if @old appears to be newer than the one in the
> + * inode.
> + *
> + * Note that we don't need to set the QUERIED flag in this case, as the value
> + * in the inode is not being recorded for later use.
>   */
> +
>  static inline s64
>  inode_cmp_iversion(const struct inode *inode, const u64 old)
>  {
> -	return (s64)inode_peek_iversion(inode) - (s64)old;
> +	return (s64)(inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED) -
> +	       (s64)(old << I_VERSION_QUERIED_SHIFT);
>  }
>  #endif
> -- 
> 2.14.3
>
Krzysztof Kozlowski Jan. 7, 2018, 12:44 p.m. UTC | #2
On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton <jlayton@kernel.org> wrote:
> From: Jeff Layton <jlayton@redhat.com>
>
> Since i_version is mostly treated as an opaque value, we can exploit that
> fact to avoid incrementing it when no one is watching. With that change,
> we can avoid incrementing the counter on writes, unless someone has
> queried for it since it was last incremented. If the a/c/mtime don't
> change, and the i_version hasn't changed, then there's no need to dirty
> the inode metadata on a write.
>
> Convert the i_version counter to an atomic64_t, and use the lowest order
> bit to hold a flag that will tell whether anyone has queried the value
> since it was last incremented.
>
> When we go to maybe increment it, we fetch the value and check the flag
> bit.  If it's clear then we don't need to do anything if the update
> isn't being forced.
>
> If we do need to update, then we increment the counter by 2, and clear
> the flag bit, and then use a CAS op to swap it into place. If that
> works, we return true. If it doesn't then do it again with the value
> that we fetch from the CAS operation.
>
> On the query side, if the flag is already set, then we just shift the
> value down by 1 bit and return it. Otherwise, we set the flag in our
> on-stack value and again use cmpxchg to swap it into place if it hasn't
> changed. If it has, then we use the value from the cmpxchg as the new
> "old" value and try again.
>
> This method allows us to avoid incrementing the counter on writes (and
> dirtying the metadata) under typical workloads. We only need to increment
> if it has been queried since it was last changed.
>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  include/linux/fs.h       |   2 +-
>  include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
>  2 files changed, 154 insertions(+), 56 deletions(-)
>

Hi,

On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
This commit popped up through bisect (log at the end). Systemd
timeouts on some device-specific services, including mounting ext4
/home:

[ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit)
[ TIME ] Timed out waiting for device dev-ttySAC2.device.
Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
Jan 07 13:29:38 [ TIME ] Timed out waiting for device
dev-disk-by\x2dlabel-home.device.
Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
/dev/disk/by-label/home.
Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
polling (1min 53s / no limit)

Kernel command line:
console=tty1 console=ttySAC2,115200n8
ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
nfsrootdebug root=/dev/nfs
nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend

/home is /dev/mmcblk1p2:
kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
tune2fs 1.43.7 (16-Oct-2017)
Filesystem volume name:   home
Last mounted on:          /home
Filesystem UUID:          3f9dbeba-2738-45d3-807e-c1b2e21128ed
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index
filetype needs_recovery extent flex_bg sparse_super large_file
uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              1430800
Block count:              5717760
Reserved block count:     285888
Free blocks:              5467576
Free inodes:              1428301
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1022
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8176
Inode blocks per group:   511
Flex block group size:    16
Filesystem created:       Thu May 21 12:17:05 2015
Last mount time:          Thu Dec 21 13:31:26 2017
Last write time:          Thu Dec 21 13:31:26 2017
Mount count:              1
Maximum mount count:      -1
Last checked:             Thu Dec 21 13:31:25 2017
Check interval:           0 (<none>)
Lifetime writes:          126 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      42e17e23-86b2-4356-ad63-78aa51651d03
Journal backup:           inode blocks


Full dmesg log:
http://www.krzk.eu/#/builders/1/builds/1258/steps/10/logs/serial0

The regular boot from rootfs on SD card also fails - but without any
serial console logs (just "Starting kernel...") so the real cause is
unknown.

Any hints?

Best regards,
Krzysztof


bisect log:
# bad: [73005e1a35fd67c644b0645c9e4c1efabd0fe62c] Add linux-next
specific files for 20180103
# good: [30a7acd573899fd8b8ac39236eff6468b195ac7d] Linux 4.15-rc6
git bisect start 'next/master' 'next/stable'
# bad: [c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a] Merge
remote-tracking branch 'crypto/master'
git bisect bad c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a
# bad: [55695d94f0915121d106cd2d1ab94983a32f3e9a] Merge
remote-tracking branch 'hid/for-next'
git bisect bad 55695d94f0915121d106cd2d1ab94983a32f3e9a
# good: [cffae1eead0dd91be1a3069a8348127bb00158f3] Merge
remote-tracking branch 'realtek/for-next'
git bisect good cffae1eead0dd91be1a3069a8348127bb00158f3
# good: [5f889f1176dc99636c6bf8af7c286decc888c007] Merge
remote-tracking branch 'btrfs/next'
git bisect good 5f889f1176dc99636c6bf8af7c286decc888c007
# good: [984c35877f36bee305e43a1c58176169854d85cf] Merge
remote-tracking branch 'xfs/for-next'
git bisect good 984c35877f36bee305e43a1c58176169854d85cf
# bad: [f9fec502daea2a869232b6dff33ba3de79dd0d61] Merge
remote-tracking branch 'printk/for-next'
git bisect bad f9fec502daea2a869232b6dff33ba3de79dd0d61
# good: [c71d227fc4133f949dae620ed5e3a250b43f2415] make kernel-side
POLL... arch-independent
git bisect good c71d227fc4133f949dae620ed5e3a250b43f2415
# good: [416d20e8c31107f5dfd45d1d80d1e6c8e4871180] Merge branches
'work.get_user_pages_fast', 'work.wmci', 'work.sock_recvmsg',
'misc.netdrv', 'misc.poll', 'work.mqueue', 'work.whack-a-mole' and
'work.misc' into for-next
git bisect good 416d20e8c31107f5dfd45d1d80d1e6c8e4871180
# good: [325a1de4a691512a48c1426b943a7b0b9f8d6744] xfs: convert to new
i_version API
git bisect good 325a1de4a691512a48c1426b943a7b0b9f8d6744
# good: [a94fe10fb114c169e7ddaecd8251521886409121] checkpatch: add
pF/pf deprecation warning
git bisect good a94fe10fb114c169e7ddaecd8251521886409121
# good: [6b3911dffd1184fdcd63299a5fee59ac000f2067] btrfs: only dirty
the inode in btrfs_update_time if something was changed
git bisect good 6b3911dffd1184fdcd63299a5fee59ac000f2067
# bad: [448f8c749a7a0ae03505823910ec45a112678048] Merge
remote-tracking branch 'iversion/iversion-next'
git bisect bad 448f8c749a7a0ae03505823910ec45a112678048
# bad: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs: handle
inode->i_version more efficiently
git bisect bad 8618bff776758ebff5b55211e7b5a60a0fc119a5
# first bad commit: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs:
handle inode->i_version more efficiently
Jeff Layton Jan. 8, 2018, 12:56 p.m. UTC | #3
On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
> On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton <jlayton@kernel.org> wrote:
> > From: Jeff Layton <jlayton@redhat.com>
> > 
> > Since i_version is mostly treated as an opaque value, we can exploit that
> > fact to avoid incrementing it when no one is watching. With that change,
> > we can avoid incrementing the counter on writes, unless someone has
> > queried for it since it was last incremented. If the a/c/mtime don't
> > change, and the i_version hasn't changed, then there's no need to dirty
> > the inode metadata on a write.
> > 
> > Convert the i_version counter to an atomic64_t, and use the lowest order
> > bit to hold a flag that will tell whether anyone has queried the value
> > since it was last incremented.
> > 
> > When we go to maybe increment it, we fetch the value and check the flag
> > bit.  If it's clear then we don't need to do anything if the update
> > isn't being forced.
> > 
> > If we do need to update, then we increment the counter by 2, and clear
> > the flag bit, and then use a CAS op to swap it into place. If that
> > works, we return true. If it doesn't then do it again with the value
> > that we fetch from the CAS operation.
> > 
> > On the query side, if the flag is already set, then we just shift the
> > value down by 1 bit and return it. Otherwise, we set the flag in our
> > on-stack value and again use cmpxchg to swap it into place if it hasn't
> > changed. If it has, then we use the value from the cmpxchg as the new
> > "old" value and try again.
> > 
> > This method allows us to avoid incrementing the counter on writes (and
> > dirtying the metadata) under typical workloads. We only need to increment
> > if it has been queried since it was last changed.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  include/linux/fs.h       |   2 +-
> >  include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
> >  2 files changed, 154 insertions(+), 56 deletions(-)
> > 
> 
> Hi,
> 
> On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
> This commit popped up through bisect (log at the end). Systemd
> timeouts on some device-specific services, including mounting ext4
> /home:
> 
> [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit)
> [ TIME ] Timed out waiting for device dev-ttySAC2.device.
> Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
> Jan 07 13:29:38 [ TIME ] Timed out waiting for device
> dev-disk-by\x2dlabel-home.device.
> Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
> Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
> Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
> /dev/disk/by-label/home.
> Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
> polling (1min 53s / no limit)
> 
> Kernel command line:
> console=tty1 console=ttySAC2,115200n8
> ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
> nfsrootdebug root=/dev/nfs
> nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
> smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
> 
> /home is /dev/mmcblk1p2:
> kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
> tune2fs 1.43.7 (16-Oct-2017)
> Filesystem volume name:   home
> Last mounted on:          /home
> Filesystem UUID:          3f9dbeba-2738-45d3-807e-c1b2e21128ed
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr resize_inode dir_index
> filetype needs_recovery extent flex_bg sparse_super large_file
> uninit_bg dir_nlink extra_isize
> Filesystem flags:         signed_directory_hash
> Default mount options:    user_xattr acl
> Filesystem state:         clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              1430800
> Block count:              5717760
> Reserved block count:     285888
> Free blocks:              5467576
> Free inodes:              1428301
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Reserved GDT blocks:      1022
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         8176
> Inode blocks per group:   511
> Flex block group size:    16
> Filesystem created:       Thu May 21 12:17:05 2015
> Last mount time:          Thu Dec 21 13:31:26 2017
> Last write time:          Thu Dec 21 13:31:26 2017
> Mount count:              1
> Maximum mount count:      -1
> Last checked:             Thu Dec 21 13:31:25 2017
> Check interval:           0 (<none>)
> Lifetime writes:          126 GB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:           256
> Required extra isize:     28
> Desired extra isize:      28
> Journal inode:            8
> Default directory hash:   half_md4
> Directory Hash Seed:      42e17e23-86b2-4356-ad63-78aa51651d03
> Journal backup:           inode blocks
> 
> 
> Full dmesg log:
> http://www.krzk.eu/#/builders/1/builds/1258/steps/10/logs/serial0
> 
> The regular boot from rootfs on SD card also fails - but without any
> serial console logs (just "Starting kernel...") so the real cause is
> unknown.
> 
> Any hints?
> 
> Best regards,
> Krzysztof
> 
> 
> bisect log:
> # bad: [73005e1a35fd67c644b0645c9e4c1efabd0fe62c] Add linux-next
> specific files for 20180103
> # good: [30a7acd573899fd8b8ac39236eff6468b195ac7d] Linux 4.15-rc6
> git bisect start 'next/master' 'next/stable'
> # bad: [c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a] Merge
> remote-tracking branch 'crypto/master'
> git bisect bad c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a
> # bad: [55695d94f0915121d106cd2d1ab94983a32f3e9a] Merge
> remote-tracking branch 'hid/for-next'
> git bisect bad 55695d94f0915121d106cd2d1ab94983a32f3e9a
> # good: [cffae1eead0dd91be1a3069a8348127bb00158f3] Merge
> remote-tracking branch 'realtek/for-next'
> git bisect good cffae1eead0dd91be1a3069a8348127bb00158f3
> # good: [5f889f1176dc99636c6bf8af7c286decc888c007] Merge
> remote-tracking branch 'btrfs/next'
> git bisect good 5f889f1176dc99636c6bf8af7c286decc888c007
> # good: [984c35877f36bee305e43a1c58176169854d85cf] Merge
> remote-tracking branch 'xfs/for-next'
> git bisect good 984c35877f36bee305e43a1c58176169854d85cf
> # bad: [f9fec502daea2a869232b6dff33ba3de79dd0d61] Merge
> remote-tracking branch 'printk/for-next'
> git bisect bad f9fec502daea2a869232b6dff33ba3de79dd0d61
> # good: [c71d227fc4133f949dae620ed5e3a250b43f2415] make kernel-side
> POLL... arch-independent
> git bisect good c71d227fc4133f949dae620ed5e3a250b43f2415
> # good: [416d20e8c31107f5dfd45d1d80d1e6c8e4871180] Merge branches
> 'work.get_user_pages_fast', 'work.wmci', 'work.sock_recvmsg',
> 'misc.netdrv', 'misc.poll', 'work.mqueue', 'work.whack-a-mole' and
> 'work.misc' into for-next
> git bisect good 416d20e8c31107f5dfd45d1d80d1e6c8e4871180
> # good: [325a1de4a691512a48c1426b943a7b0b9f8d6744] xfs: convert to new
> i_version API
> git bisect good 325a1de4a691512a48c1426b943a7b0b9f8d6744
> # good: [a94fe10fb114c169e7ddaecd8251521886409121] checkpatch: add
> pF/pf deprecation warning
> git bisect good a94fe10fb114c169e7ddaecd8251521886409121
> # good: [6b3911dffd1184fdcd63299a5fee59ac000f2067] btrfs: only dirty
> the inode in btrfs_update_time if something was changed
> git bisect good 6b3911dffd1184fdcd63299a5fee59ac000f2067
> # bad: [448f8c749a7a0ae03505823910ec45a112678048] Merge
> remote-tracking branch 'iversion/iversion-next'
> git bisect bad 448f8c749a7a0ae03505823910ec45a112678048
> # bad: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs: handle
> inode->i_version more efficiently
> git bisect bad 8618bff776758ebff5b55211e7b5a60a0fc119a5
> # first bad commit: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs:
> handle inode->i_version more efficiently

That's really strange. I'm afraid I have no idea what could be going on.

With NFS, we really just treat i_version as an opaque value, so I'm not
sure how this patch in particular would affect anything there. We _do_
increment it if you have a write delegation in some cases, but not many
servers hand those out.

ext4 will only touch the i_version field if you mount it with '-o
iversion'. Are you doing that here?

Have you run the bisect more than once? Is this maybe an intermittent
problem, and the bisect has landed on the wrong commit?

Thanks,
Krzysztof Kozlowski Jan. 8, 2018, 1:21 p.m. UTC | #4
On Mon, Jan 8, 2018 at 1:56 PM, Jeff Layton <jlayton@kernel.org> wrote:
> On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
>> On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton <jlayton@kernel.org> wrote:
>> > From: Jeff Layton <jlayton@redhat.com>
>> >
>> > Since i_version is mostly treated as an opaque value, we can exploit that
>> > fact to avoid incrementing it when no one is watching. With that change,
>> > we can avoid incrementing the counter on writes, unless someone has
>> > queried for it since it was last incremented. If the a/c/mtime don't
>> > change, and the i_version hasn't changed, then there's no need to dirty
>> > the inode metadata on a write.
>> >
>> > Convert the i_version counter to an atomic64_t, and use the lowest order
>> > bit to hold a flag that will tell whether anyone has queried the value
>> > since it was last incremented.
>> >
>> > When we go to maybe increment it, we fetch the value and check the flag
>> > bit.  If it's clear then we don't need to do anything if the update
>> > isn't being forced.
>> >
>> > If we do need to update, then we increment the counter by 2, and clear
>> > the flag bit, and then use a CAS op to swap it into place. If that
>> > works, we return true. If it doesn't then do it again with the value
>> > that we fetch from the CAS operation.
>> >
>> > On the query side, if the flag is already set, then we just shift the
>> > value down by 1 bit and return it. Otherwise, we set the flag in our
>> > on-stack value and again use cmpxchg to swap it into place if it hasn't
>> > changed. If it has, then we use the value from the cmpxchg as the new
>> > "old" value and try again.
>> >
>> > This method allows us to avoid incrementing the counter on writes (and
>> > dirtying the metadata) under typical workloads. We only need to increment
>> > if it has been queried since it was last changed.
>> >
>> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > ---
>> >  include/linux/fs.h       |   2 +-
>> >  include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
>> >  2 files changed, 154 insertions(+), 56 deletions(-)
>> >
>>
>> Hi,
>>
>> On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
>> This commit popped up through bisect (log at the end). Systemd
>> timeouts on some device-specific services, including mounting ext4
>> /home:
>>
>> [ *** ] (1 of 4) A start job is running for…ress polling (1min 41s / no limit)
>> [ TIME ] Timed out waiting for device dev-ttySAC2.device.
>> Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
>> Jan 07 13:29:38 [ TIME ] Timed out waiting for device
>> dev-disk-by\x2dlabel-home.device.
>> Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
>> Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
>> Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
>> /dev/disk/by-label/home.
>> Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running for…ress
>> polling (1min 53s / no limit)
>>
>> Kernel command line:
>> console=tty1 console=ttySAC2,115200n8
>> ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
>> nfsrootdebug root=/dev/nfs
>> nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
>> smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
>>
>> /home is /dev/mmcblk1p2:
>> kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
>> tune2fs 1.43.7 (16-Oct-2017)
>> Filesystem volume name:   home
>> Last mounted on:          /home
>> Filesystem UUID:          3f9dbeba-2738-45d3-807e-c1b2e21128ed
>> Filesystem magic number:  0xEF53
>> Filesystem revision #:    1 (dynamic)
>> Filesystem features:      has_journal ext_attr resize_inode dir_index
>> filetype needs_recovery extent flex_bg sparse_super large_file
>> uninit_bg dir_nlink extra_isize
>> Filesystem flags:         signed_directory_hash
>> Default mount options:    user_xattr acl
>> Filesystem state:         clean
>> Errors behavior:          Continue
>> Filesystem OS type:       Linux
>> Inode count:              1430800
>> Block count:              5717760
>> Reserved block count:     285888
>> Free blocks:              5467576
>> Free inodes:              1428301
>> First block:              0
>> Block size:               4096
>> Fragment size:            4096
>> Reserved GDT blocks:      1022
>> Blocks per group:         32768
>> Fragments per group:      32768
>> Inodes per group:         8176
>> Inode blocks per group:   511
>> Flex block group size:    16
>> Filesystem created:       Thu May 21 12:17:05 2015
>> Last mount time:          Thu Dec 21 13:31:26 2017
>> Last write time:          Thu Dec 21 13:31:26 2017
>> Mount count:              1
>> Maximum mount count:      -1
>> Last checked:             Thu Dec 21 13:31:25 2017
>> Check interval:           0 (<none>)
>> Lifetime writes:          126 GB
>> Reserved blocks uid:      0 (user root)
>> Reserved blocks gid:      0 (group root)
>> First inode:              11
>> Inode size:           256
>> Required extra isize:     28
>> Desired extra isize:      28
>> Journal inode:            8
>> Default directory hash:   half_md4
>> Directory Hash Seed:      42e17e23-86b2-4356-ad63-78aa51651d03
>> Journal backup:           inode blocks
>>
>>
>> Full dmesg log:
>> http://www.krzk.eu/#/builders/1/builds/1258/steps/10/logs/serial0
>>
>> The regular boot from rootfs on SD card also fails - but without any
>> serial console logs (just "Starting kernel...") so the real cause is
>> unknown.
>>
>> Any hints?
>>
>> Best regards,
>> Krzysztof
>>
>>
>> bisect log:
>> # bad: [73005e1a35fd67c644b0645c9e4c1efabd0fe62c] Add linux-next
>> specific files for 20180103
>> # good: [30a7acd573899fd8b8ac39236eff6468b195ac7d] Linux 4.15-rc6
>> git bisect start 'next/master' 'next/stable'
>> # bad: [c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a] Merge
>> remote-tracking branch 'crypto/master'
>> git bisect bad c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a
>> # bad: [55695d94f0915121d106cd2d1ab94983a32f3e9a] Merge
>> remote-tracking branch 'hid/for-next'
>> git bisect bad 55695d94f0915121d106cd2d1ab94983a32f3e9a
>> # good: [cffae1eead0dd91be1a3069a8348127bb00158f3] Merge
>> remote-tracking branch 'realtek/for-next'
>> git bisect good cffae1eead0dd91be1a3069a8348127bb00158f3
>> # good: [5f889f1176dc99636c6bf8af7c286decc888c007] Merge
>> remote-tracking branch 'btrfs/next'
>> git bisect good 5f889f1176dc99636c6bf8af7c286decc888c007
>> # good: [984c35877f36bee305e43a1c58176169854d85cf] Merge
>> remote-tracking branch 'xfs/for-next'
>> git bisect good 984c35877f36bee305e43a1c58176169854d85cf
>> # bad: [f9fec502daea2a869232b6dff33ba3de79dd0d61] Merge
>> remote-tracking branch 'printk/for-next'
>> git bisect bad f9fec502daea2a869232b6dff33ba3de79dd0d61
>> # good: [c71d227fc4133f949dae620ed5e3a250b43f2415] make kernel-side
>> POLL... arch-independent
>> git bisect good c71d227fc4133f949dae620ed5e3a250b43f2415
>> # good: [416d20e8c31107f5dfd45d1d80d1e6c8e4871180] Merge branches
>> 'work.get_user_pages_fast', 'work.wmci', 'work.sock_recvmsg',
>> 'misc.netdrv', 'misc.poll', 'work.mqueue', 'work.whack-a-mole' and
>> 'work.misc' into for-next
>> git bisect good 416d20e8c31107f5dfd45d1d80d1e6c8e4871180
>> # good: [325a1de4a691512a48c1426b943a7b0b9f8d6744] xfs: convert to new
>> i_version API
>> git bisect good 325a1de4a691512a48c1426b943a7b0b9f8d6744
>> # good: [a94fe10fb114c169e7ddaecd8251521886409121] checkpatch: add
>> pF/pf deprecation warning
>> git bisect good a94fe10fb114c169e7ddaecd8251521886409121
>> # good: [6b3911dffd1184fdcd63299a5fee59ac000f2067] btrfs: only dirty
>> the inode in btrfs_update_time if something was changed
>> git bisect good 6b3911dffd1184fdcd63299a5fee59ac000f2067
>> # bad: [448f8c749a7a0ae03505823910ec45a112678048] Merge
>> remote-tracking branch 'iversion/iversion-next'
>> git bisect bad 448f8c749a7a0ae03505823910ec45a112678048
>> # bad: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs: handle
>> inode->i_version more efficiently
>> git bisect bad 8618bff776758ebff5b55211e7b5a60a0fc119a5
>> # first bad commit: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs:
>> handle inode->i_version more efficiently
>
> That's really strange. I'm afraid I have no idea what could be going on.
>
> With NFS, we really just treat i_version as an opaque value, so I'm not
> sure how this patch in particular would affect anything there. We _do_
> increment it if you have a write delegation in some cases, but not many
> servers hand those out.

About the NFS server, Arch Linux on Raspberry Pi (so 32-bit, ARMv6):
Linux pi 4.9.70-1-ARCH #1 SMP Mon Dec 18 19:38:00 UTC 2017 armv6l GNU/Linux

The /etc/nfs.conf is default except:
[nfsd]
vers2=n
vers3=n

The client logs for nfsroot mounts are:
Jan 08 14:07:25 :: running hook [net_nfs4]
Jan 08 14:07:25 IP-Config: eth0 hardware address ba:17:70:7e:87:d1 mtu 1500
Jan 08 14:07:25 IP-Config: eth0 guessed broadcast address 192.168.1.255
Jan 08 14:07:25 IP-Config: eth0 complete (from 192.168.1.10):
Jan 08 14:07:25 address: 192.168.1.11 broadcast: 192.168.1.255
netmask: 255.255.255.0
Jan 08 14:07:25 gateway: 192.168.1.1 dns0 : 0.0.0.0 dns1 : 0.0.0.0
Jan 08 14:07:25 rootserver: 192.168.1.10 rootpath:
Jan 08 14:07:25 filename :
Jan 08 14:07:25 NFS-Mount: 192.168.1.10:/srv/nfs/odroidxu3
Jan 08 14:07:25 Waiting 10 seconds for device /dev/nfs ...
Jan 08 14:07:36 ERROR: device '/dev/nfs' not found. Skipping fsck.
Jan 08 14:07:36 Mount cmd:
Jan 08 14:07:36 /opt/tools/buildbot/arch-arm-bin/mount.nfs4 -o
vers=4,nolock 192.168.1.10:/srv/nfs/odroidxu3 /new_root

Only root (/) is froom NFS. The /home comes from sdcard (/dev/mmcblk1).

> ext4 will only touch the i_version field if you mount it with '-o
> iversion'. Are you doing that here?

The /home is mounted by systemd from /etc/fstab:
      LABEL=home /home ext4 defaults 0 2

>
> Have you run the bisect more than once? Is this maybe an intermittent
> problem, and the bisect has landed on the wrong commit?

I just run tests on commits around again (but not full bisect) on
current next (next-20180108):
fbf97ece47d66d22 - works
3da7bcdd695bae43 ("fs: handle inode->i_version more efficiently") -
fails: http://www.krzk.eu/#/builders/1/builds/1267


Best regards,
Krzysztof
Matthew Wilcox Jan. 8, 2018, 1:30 p.m. UTC | #5
On Fri, Dec 22, 2017 at 07:05:56AM -0500, Jeff Layton wrote:
> +	cur = inode_peek_iversion_raw(inode);
> +	for (;;) {
> +		/* If flag is clear then we needn't do anything */
> +		if (!force && !(cur & I_VERSION_QUERIED))
> +			return false;
> +		/* Since lowest bit is flag, add 2 to avoid it */
> +		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;

Isn't this an extraordinarily complicated way of spelling:

		new = cur + 1;

We know 'cur' has I_VERSION_QUERIED set, so clearing that bit and adding
two is going to be the same as adding 1 ... right?
Jeff Layton Jan. 8, 2018, 1:46 p.m. UTC | #6
On Mon, 2018-01-08 at 05:30 -0800, Matthew Wilcox wrote:
> On Fri, Dec 22, 2017 at 07:05:56AM -0500, Jeff Layton wrote:
> > +	cur = inode_peek_iversion_raw(inode);
> > +	for (;;) {
> > +		/* If flag is clear then we needn't do anything */
> > +		if (!force && !(cur & I_VERSION_QUERIED))
> > +			return false;
> > +		/* Since lowest bit is flag, add 2 to avoid it */
> > +		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> 
> Isn't this an extraordinarily complicated way of spelling:
> 
> 		new = cur + 1;
> 
> We know 'cur' has I_VERSION_QUERIED set, so clearing that bit and adding
> two is going to be the same as adding 1 ... right?
> 

It would be, but if "force" is true, then I_VERSION_QUERIED may not be
set.
diff mbox

Patch

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 76382c24e9d0..6804d075933e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -639,7 +639,7 @@  struct inode {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
 	};
-	u64			i_version;
+	atomic64_t		i_version;
 	atomic_t		i_count;
 	atomic_t		i_dio_count;
 	atomic_t		i_writecount;
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index e08c634779df..cef242e54489 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -5,6 +5,8 @@ 
 #include <linux/fs.h>
 
 /*
+ * The inode->i_version field:
+ * ---------------------------
  * The change attribute (i_version) is mandated by NFSv4 and is mostly for
  * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
  * appear different to observers if there was a change to the inode's data or
@@ -27,86 +29,171 @@ 
  * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
  * We consider these sorts of filesystems to have a kernel-managed i_version.
  *
+ * This implementation uses the low bit in the i_version field as a flag to
+ * track when the value has been queried. If it has not been queried since it
+ * was last incremented, we can skip the increment in most cases.
+ *
+ * In the event that we're updating the ctime, we will usually go ahead and
+ * bump the i_version anyway. Since that has to go to stable storage in some
+ * fashion, we might as well increment it as well.
+ *
+ * With this implementation, the value should always appear to observers to
+ * increase over time if the file has changed. It's recommended to use
+ * inode_cmp_iversion() helper to compare values.
+ *
  * Note that some filesystems (e.g. NFS and AFS) just use the field to store
- * a server-provided value (for the most part). For that reason, those
+ * a server-provided value for the most part. For that reason, those
  * filesystems do not set SB_I_VERSION. These filesystems are considered to
  * have a self-managed i_version.
+ *
+ * Persistently storing the i_version
+ * ----------------------------------
+ * Queries of the i_version field are not gated on them hitting the backing
+ * store. It's always possible that the host could crash after allowing
+ * a query of the value but before it has made it to disk.
+ *
+ * To mitigate this problem, filesystems should always use
+ * inode_set_iversion_queried when loading an existing inode from disk. This
+ * ensures that the next attempted inode increment will result in the value
+ * changing.
+ *
+ * Storing the value to disk therefore does not count as a query, so those
+ * filesystems should use inode_peek_iversion to grab the value to be stored.
+ * There is no need to flag the value as having been queried in that case.
  */
 
+/*
+ * We borrow the lowest bit in the i_version to use as a flag to tell whether
+ * it has been queried since we last incremented it. If it has, then we must
+ * increment it on the next change. After that, we can clear the flag and
+ * avoid incrementing it again until it has again been queried.
+ */
+#define I_VERSION_QUERIED_SHIFT	(1)
+#define I_VERSION_QUERIED	(1ULL << (I_VERSION_QUERIED_SHIFT - 1))
+#define I_VERSION_INCREMENT	(1ULL << I_VERSION_QUERIED_SHIFT)
+
 /**
  * inode_set_iversion_raw - set i_version to the specified raw value
  * @inode: inode to set
- * @new: new i_version value to set
+ * @val: new i_version value to set
  *
- * Set @inode's i_version field to @new. This function is for use by
+ * Set @inode's i_version field to @val. This function is for use by
  * filesystems that self-manage the i_version.
  *
  * For example, the NFS client stores its NFSv4 change attribute in this way,
  * and the AFS client stores the data_version from the server here.
  */
 static inline void
-inode_set_iversion_raw(struct inode *inode, const u64 new)
+inode_set_iversion_raw(struct inode *inode, const u64 val)
+{
+	atomic64_set(&inode->i_version, val);
+}
+
+/**
+ * inode_peek_iversion_raw - grab a "raw" iversion value
+ * @inode: inode from which i_version should be read
+ *
+ * Grab a "raw" inode->i_version value and return it. The i_version is not
+ * flagged or converted in any way. This is mostly used to access a self-managed
+ * i_version.
+ *
+ * With those filesystems, we want to treat the i_version as an entirely
+ * opaque value.
+ */
+static inline u64
+inode_peek_iversion_raw(const struct inode *inode)
 {
-	inode->i_version = new;
+	return atomic64_read(&inode->i_version);
 }
 
 /**
  * inode_set_iversion - set i_version to a particular value
  * @inode: inode to set
- * @new: new i_version value to set
+ * @val: new i_version value to set
  *
- * Set @inode's i_version field to @new. This function is for filesystems with
- * a kernel-managed i_version.
+ * Set @inode's i_version field to @val. This function is for filesystems with
+ * a kernel-managed i_version, for initializing a newly-created inode from
+ * scratch.
  *
- * For now, this just does the same thing as the _raw variant.
+ * In this case, we do not set the QUERIED flag since we know that this value
+ * has never been queried.
  */
 static inline void
-inode_set_iversion(struct inode *inode, const u64 new)
+inode_set_iversion(struct inode *inode, const u64 val)
 {
-	inode_set_iversion_raw(inode, new);
+	inode_set_iversion_raw(inode, val << I_VERSION_QUERIED_SHIFT);
 }
 
 /**
- * inode_set_iversion_queried - set i_version to a particular value and set
- *                              flag to indicate that it has been viewed
+ * inode_set_iversion_queried - set i_version to a particular value as quereied
  * @inode: inode to set
- * @new: new i_version value to set
+ * @val: new i_version value to set
  *
- * When loading in an i_version value from a backing store, we typically don't
- * know whether it was previously viewed before being stored or not. Thus, we
- * must assume that it was, to ensure that any changes will result in the
- * value changing.
+ * Set @inode's i_version field to @val, and flag it for increment on the next
+ * change.
  *
- * This function will set the inode's i_version, and possibly flag the value
- * as if it has already been viewed at least once.
+ * Filesystems that persistently store the i_version on disk should use this
+ * when loading an existing inode from disk.
  *
- * For now, this just does what inode_set_iversion does.
+ * When loading in an i_version value from a backing store, we can't be certain
+ * that it wasn't previously viewed before being stored. Thus, we must assume
+ * that it was, to ensure that we don't end up handing out the same value for
+ * different versions of the same inode.
  */
 static inline void
-inode_set_iversion_queried(struct inode *inode, const u64 new)
+inode_set_iversion_queried(struct inode *inode, const u64 val)
 {
-	inode_set_iversion(inode, new);
+	inode_set_iversion_raw(inode, (val << I_VERSION_QUERIED_SHIFT) |
+				I_VERSION_QUERIED);
 }
 
 /**
  * inode_maybe_inc_iversion - increments i_version
  * @inode: inode with the i_version that should be updated
- * @force: increment the counter even if it's not necessary
+ * @force: increment the counter even if it's not necessary?
  *
  * Every time the inode is modified, the i_version field must be seen to have
  * changed by any observer.
  *
- * In this implementation, we always increment it after taking the i_lock to
- * ensure that we don't race with other incrementors.
+ * If "force" is set or the QUERIED flag is set, then ensure that we increment
+ * the value, and clear the queried flag.
  *
- * Returns true if counter was bumped, and false if it wasn't.
+ * In the common case where neither is set, then we can return "false" without
+ * updating i_version.
+ *
+ * If this function returns false, and no other metadata has changed, then we
+ * can avoid logging the metadata.
  */
 static inline bool
 inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
-	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
+	u64 cur, old, new;
+
+	/*
+	 * The i_version field is not strictly ordered with any other inode
+	 * information, but the legacy inode_inc_iversion code used a spinlock
+	 * to serialize increments.
+	 *
+	 * Here, we add full memory barriers to ensure that any de-facto
+	 * ordering with other info is preserved.
+	 *
+	 * This barrier pairs with the barrier in inode_query_iversion()
+	 */
+	smp_mb();
+	cur = inode_peek_iversion_raw(inode);
+	for (;;) {
+		/* If flag is clear then we needn't do anything */
+		if (!force && !(cur & I_VERSION_QUERIED))
+			return false;
 
-	atomic64_inc(ivp);
+		/* Since lowest bit is flag, add 2 to avoid it */
+		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
+
+		old = atomic64_cmpxchg(&inode->i_version, cur, new);
+		if (likely(old == cur))
+			break;
+		cur = old;
+	}
 	return true;
 }
 
@@ -129,31 +216,12 @@  inode_inc_iversion(struct inode *inode)
  * @inode: inode to check
  *
  * Returns whether the inode->i_version counter needs incrementing on the next
- * change.
- *
- * For now, we assume that it always does.
+ * change. Just fetch the value and check the QUERIED flag.
  */
 static inline bool
 inode_iversion_need_inc(struct inode *inode)
 {
-	return true;
-}
-
-/**
- * inode_peek_iversion_raw - grab a "raw" iversion value
- * @inode: inode from which i_version should be read
- *
- * Grab a "raw" inode->i_version value and return it. The i_version is not
- * flagged or converted in any way. This is mostly used to access a self-managed
- * i_version.
- *
- * With those filesystems, we want to treat the i_version as an entirely
- * opaque value.
- */
-static inline u64
-inode_peek_iversion_raw(const struct inode *inode)
-{
-	return inode->i_version;
+	return inode_peek_iversion_raw(inode) & I_VERSION_QUERIED;
 }
 
 /**
@@ -170,7 +238,7 @@  inode_peek_iversion_raw(const struct inode *inode)
 static inline u64
 inode_peek_iversion(const struct inode *inode)
 {
-	return inode_peek_iversion_raw(inode);
+	return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT;
 }
 
 /**
@@ -182,12 +250,35 @@  inode_peek_iversion(const struct inode *inode)
  * that a later query of the i_version will result in a different value if
  * anything has changed.
  *
- * This implementation just does a peek.
+ * In this implementation, we fetch the current value, set the QUERIED flag and
+ * then try to swap it into place with a cmpxchg, if it wasn't already set. If
+ * that fails, we try again with the newly fetched value from the cmpxchg.
  */
 static inline u64
 inode_query_iversion(struct inode *inode)
 {
-	return inode_peek_iversion(inode);
+	u64 cur, old, new;
+
+	cur = inode_peek_iversion_raw(inode);
+	for (;;) {
+		/* If flag is already set, then no need to swap */
+		if (cur & I_VERSION_QUERIED) {
+			/*
+			 * This barrier (and the implicit barrier in the
+			 * cmpxchg below) pairs with the barrier in
+			 * inode_maybe_inc_iversion().
+			 */
+			smp_mb();
+			break;
+		}
+
+		new = cur | I_VERSION_QUERIED;
+		old = atomic64_cmpxchg(&inode->i_version, cur, new);
+		if (likely(old == cur))
+			break;
+		cur = old;
+	}
+	return cur >> I_VERSION_QUERIED_SHIFT;
 }
 
 /**
@@ -196,11 +287,18 @@  inode_query_iversion(struct inode *inode)
  * @old: old value to check against its i_version
  *
  * Compare an i_version counter with a previous one. Returns 0 if they are
- * the same or non-zero if they are different.
+ * the same, a positive value if the one in the inode appears newer than @old,
+ * and a negative value if @old appears to be newer than the one in the
+ * inode.
+ *
+ * Note that we don't need to set the QUERIED flag in this case, as the value
+ * in the inode is not being recorded for later use.
  */
+
 static inline s64
 inode_cmp_iversion(const struct inode *inode, const u64 old)
 {
-	return (s64)inode_peek_iversion(inode) - (s64)old;
+	return (s64)(inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED) -
+	       (s64)(old << I_VERSION_QUERIED_SHIFT);
 }
 #endif