Message ID | 20171218151156.14565-20-jlayton@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon 18-12-17 10:11:56, Jeff Layton wrote: > static inline bool > inode_maybe_inc_iversion(struct inode *inode, bool force) > { > - atomic64_t *ivp = (atomic64_t *)&inode->i_version; > + u64 cur, old, new; > > - atomic64_inc(ivp); > + cur = (u64)atomic64_read(&inode->i_version); > + for (;;) { > + /* If flag is clear then we needn't do anything */ > + if (!force && !(cur & I_VERSION_QUERIED)) > + return false; The fast path here misses any memory barrier. Thus it seems this query could be in theory reordered before any store that happened to modify the inode? Or maybe we could race and miss the fact that in fact this i_version has already been queried? But maybe there's some higher level locking that makes sure this is all a non-issue... But in that case it would deserve some comment I guess. > + > + /* Since lowest bit is flag, add 2 to avoid it */ > + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT; > + > + old = atomic64_cmpxchg(&inode->i_version, cur, new); > + if (likely(old == cur)) > + break; > + cur = old; > + } > return true; > } > ... > static inline u64 > inode_query_iversion(struct inode *inode) > { > - return inode_peek_iversion(inode); > + u64 cur, old, new; > + > + cur = atomic64_read(&inode->i_version); > + for (;;) { > + /* If flag is already set, then no need to swap */ > + if (cur & I_VERSION_QUERIED) > + break; > + > + new = cur | I_VERSION_QUERIED; > + old = atomic64_cmpxchg(&inode->i_version, cur, new); > + if (old == cur) > + break; > + cur = old; > + } Why not just use atomic64_or() here? Honza
On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote: > On Mon 18-12-17 10:11:56, Jeff Layton wrote: > > static inline bool > > inode_maybe_inc_iversion(struct inode *inode, bool force) > > { > > - atomic64_t *ivp = (atomic64_t *)&inode->i_version; > > + u64 cur, old, new; > > > > - atomic64_inc(ivp); > > + cur = (u64)atomic64_read(&inode->i_version); > > + for (;;) { > > + /* If flag is clear then we needn't do anything */ > > + if (!force && !(cur & I_VERSION_QUERIED)) > > + return false; > > The fast path here misses any memory barrier. Thus it seems this query > could be in theory reordered before any store that happened to modify the > inode? Or maybe we could race and miss the fact that in fact this i_version > has already been queried? But maybe there's some higher level locking that > makes sure this is all a non-issue... But in that case it would deserve > some comment I guess. > There's no higher-level locking. Getting locking out of this codepath is a good thing IMO. The larger question here is whether we really care about ordering this with anything else. The i_version, as implemented today, is not ordered with actual changes to the inode. We only take the i_lock today when modifying it, not when querying it. It's possible today that you could see the results of a change and then do a fetch of the i_version that doesn't show an increment vs. a previous change. It'd be nice if this were atomic with the actual changes that it represents, but I think that would be prohibitively expensive. That may be something we need to address. I'm not sure we really want to do it as part of this patchset though. > > + > > + /* Since lowest bit is flag, add 2 to avoid it */ > > + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT; > > + > > + old = atomic64_cmpxchg(&inode->i_version, cur, new); > > + if (likely(old == cur)) > > + break; > > + cur = old; > > + } > > return true; > > } > > > > ... > > > static inline u64 > > inode_query_iversion(struct inode *inode) > > { > > - return inode_peek_iversion(inode); > > + u64 cur, old, new; > > + > > + cur = atomic64_read(&inode->i_version); > > + for (;;) { > > + /* If flag is already set, then no need to swap */ > > + if (cur & I_VERSION_QUERIED) > > + break; > > + > > + new = cur | I_VERSION_QUERIED; > > + old = atomic64_cmpxchg(&inode->i_version, cur, new); > > + if (old == cur) > > + break; > > + cur = old; > > + } > > Why not just use atomic64_or() here? > If the cmpxchg fails, then either: 1) it was incremented 2) someone flagged it QUERIED If an increment happened then we don't need to flag it as QUERIED if we're returning an older value. If we use atomic64_or, then we can't tell if an increment happened so we'd end up potentially flagging it more than necessary. In principle, either outcome is technically OK and we don't have to loop if the cmpxchg doesn't work. That said, if we think there might be a later i_version available, then I think we probably want to try to query it again so we can return as late a one as possible.
On Mon, Dec 18, 2017 at 12:22:20PM -0500, Jeff Layton wrote: > On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote: > > On Mon 18-12-17 10:11:56, Jeff Layton wrote: > > > static inline bool > > > inode_maybe_inc_iversion(struct inode *inode, bool force) > > > { > > > - atomic64_t *ivp = (atomic64_t *)&inode->i_version; > > > + u64 cur, old, new; > > > > > > - atomic64_inc(ivp); > > > + cur = (u64)atomic64_read(&inode->i_version); > > > + for (;;) { > > > + /* If flag is clear then we needn't do anything */ > > > + if (!force && !(cur & I_VERSION_QUERIED)) > > > + return false; > > > > The fast path here misses any memory barrier. Thus it seems this query > > could be in theory reordered before any store that happened to modify the > > inode? Or maybe we could race and miss the fact that in fact this i_version > > has already been queried? But maybe there's some higher level locking that > > makes sure this is all a non-issue... But in that case it would deserve > > some comment I guess. > > > > There's no higher-level locking. Getting locking out of this codepath is > a good thing IMO. The larger question here is whether we really care > about ordering this with anything else. > > The i_version, as implemented today, is not ordered with actual changes > to the inode. We only take the i_lock today when modifying it, not when > querying it. It's possible today that you could see the results of a > change and then do a fetch of the i_version that doesn't show an > increment vs. a previous change. > > It'd be nice if this were atomic with the actual changes that it > represents, but I think that would be prohibitively expensive. That may > be something we need to address. I'm not sure we really want to do it as > part of this patchset though. I don't think we need full atomicity, but we *do* need the i_version increment to be seen as ordered after the inode change. That should be relatively inexpensive, yes? Otherwise an inode change could be overlooked indefinitely, because a client could cache the old contents with the new i_version value and never see the i_version change again as long as the inode isn't touched again. (If the client instead caches new data with the old inode version, there may be an unnecessary cache invalidation next time the client queries the change attribute. That's a performance bug, but not really a correctness problem, at least for clients depending only on close-to-open semantics: they'll only depend on seeing the new change attribute after the change has been flushed to disk. The performance problem should be mitigated by the race being very rare.) --b.
On Mon 18-12-17 12:22:20, Jeff Layton wrote: > On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote: > > On Mon 18-12-17 10:11:56, Jeff Layton wrote: > > > static inline bool > > > inode_maybe_inc_iversion(struct inode *inode, bool force) > > > { > > > - atomic64_t *ivp = (atomic64_t *)&inode->i_version; > > > + u64 cur, old, new; > > > > > > - atomic64_inc(ivp); > > > + cur = (u64)atomic64_read(&inode->i_version); > > > + for (;;) { > > > + /* If flag is clear then we needn't do anything */ > > > + if (!force && !(cur & I_VERSION_QUERIED)) > > > + return false; > > > > The fast path here misses any memory barrier. Thus it seems this query > > could be in theory reordered before any store that happened to modify the > > inode? Or maybe we could race and miss the fact that in fact this i_version > > has already been queried? But maybe there's some higher level locking that > > makes sure this is all a non-issue... But in that case it would deserve > > some comment I guess. > > > > There's no higher-level locking. Getting locking out of this codepath is > a good thing IMO. The larger question here is whether we really care > about ordering this with anything else. > > The i_version, as implemented today, is not ordered with actual changes > to the inode. We only take the i_lock today when modifying it, not when > querying it. It's possible today that you could see the results of a > change and then do a fetch of the i_version that doesn't show an > increment vs. a previous change. Yeah, so I don't suggest that you should fix unrelated issues but original i_lock protection did actually provide memory barriers (although semi-permeable, but in practice they are very often enough) and your patch removing those could have changed a theoretical issue to a practical problem. So at least preserving that original acquire-release semantics of i_version handling would be IMHO good. > It'd be nice if this were atomic with the actual changes that it > represents, but I think that would be prohibitively expensive. That may > be something we need to address. I'm not sure we really want to do it as > part of this patchset though. > > > > + > > > + /* Since lowest bit is flag, add 2 to avoid it */ > > > + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT; > > > + > > > + old = atomic64_cmpxchg(&inode->i_version, cur, new); > > > + if (likely(old == cur)) > > > + break; > > > + cur = old; > > > + } > > > return true; > > > } > > > > > > > ... > > > > > static inline u64 > > > inode_query_iversion(struct inode *inode) > > > { > > > - return inode_peek_iversion(inode); > > > + u64 cur, old, new; > > > + > > > + cur = atomic64_read(&inode->i_version); > > > + for (;;) { > > > + /* If flag is already set, then no need to swap */ > > > + if (cur & I_VERSION_QUERIED) > > > + break; > > > + > > > + new = cur | I_VERSION_QUERIED; > > > + old = atomic64_cmpxchg(&inode->i_version, cur, new); > > > + if (old == cur) > > > + break; > > > + cur = old; > > > + } > > > > Why not just use atomic64_or() here? > > > > If the cmpxchg fails, then either: > > 1) it was incremented > 2) someone flagged it QUERIED > > If an increment happened then we don't need to flag it as QUERIED if > we're returning an older value. If we use atomic64_or, then we can't > tell if an increment happened so we'd end up potentially flagging it > more than necessary. > > In principle, either outcome is technically OK and we don't have to loop > if the cmpxchg doesn't work. That said, if we think there might be a > later i_version available, then I think we probably want to try to query > it again so we can return as late a one as possible. OK, makes sense. I'm just a bit vary of cmpxchg loops as they tend to behave pretty badly in contended cases but I guess i_version won't be hammered *that* hard. Honza
On Tue, 2017-12-19 at 10:29 +0100, Jan Kara wrote: > On Mon 18-12-17 12:22:20, Jeff Layton wrote: > > On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote: > > > On Mon 18-12-17 10:11:56, Jeff Layton wrote: > > > > static inline bool > > > > inode_maybe_inc_iversion(struct inode *inode, bool force) > > > > { > > > > - atomic64_t *ivp = (atomic64_t *)&inode->i_version; > > > > + u64 cur, old, new; > > > > > > > > - atomic64_inc(ivp); > > > > + cur = (u64)atomic64_read(&inode->i_version); > > > > + for (;;) { > > > > + /* If flag is clear then we needn't do anything */ > > > > + if (!force && !(cur & I_VERSION_QUERIED)) > > > > + return false; > > > > > > The fast path here misses any memory barrier. Thus it seems this query > > > could be in theory reordered before any store that happened to modify the > > > inode? Or maybe we could race and miss the fact that in fact this i_version > > > has already been queried? But maybe there's some higher level locking that > > > makes sure this is all a non-issue... But in that case it would deserve > > > some comment I guess. > > > > > > > There's no higher-level locking. Getting locking out of this codepath is > > a good thing IMO. The larger question here is whether we really care > > about ordering this with anything else. > > > > The i_version, as implemented today, is not ordered with actual changes > > to the inode. We only take the i_lock today when modifying it, not when > > querying it. It's possible today that you could see the results of a > > change and then do a fetch of the i_version that doesn't show an > > increment vs. a previous change. > > Yeah, so I don't suggest that you should fix unrelated issues but original > i_lock protection did actually provide memory barriers (although > semi-permeable, but in practice they are very often enough) and your patch > removing those could have changed a theoretical issue to a practical > problem. So at least preserving that original acquire-release semantics > of i_version handling would be IMHO good. > Agreed. I've no objection to memory barriers here and I'm looking at that, I just need to go over Dave's comments and memory-barriers.txt (again!) to make sure I get them right. > > It'd be nice if this were atomic with the actual changes that it > > represents, but I think that would be prohibitively expensive. That may > > be something we need to address. I'm not sure we really want to do it as > > part of this patchset though. > > > > > > + > > > > + /* Since lowest bit is flag, add 2 to avoid it */ > > > > + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT; > > > > + > > > > + old = atomic64_cmpxchg(&inode->i_version, cur, new); > > > > + if (likely(old == cur)) > > > > + break; > > > > + cur = old; > > > > + } > > > > return true; > > > > } > > > > > > > > > > ... > > > > > > > static inline u64 > > > > inode_query_iversion(struct inode *inode) > > > > { > > > > - return inode_peek_iversion(inode); > > > > + u64 cur, old, new; > > > > + > > > > + cur = atomic64_read(&inode->i_version); > > > > + for (;;) { > > > > + /* If flag is already set, then no need to swap */ > > > > + if (cur & I_VERSION_QUERIED) > > > > + break; > > > > + > > > > + new = cur | I_VERSION_QUERIED; > > > > + old = atomic64_cmpxchg(&inode->i_version, cur, new); > > > > + if (old == cur) > > > > + break; > > > > + cur = old; > > > > + } > > > > > > Why not just use atomic64_or() here? > > > > > > > If the cmpxchg fails, then either: > > > > 1) it was incremented > > 2) someone flagged it QUERIED > > > > If an increment happened then we don't need to flag it as QUERIED if > > we're returning an older value. If we use atomic64_or, then we can't > > tell if an increment happened so we'd end up potentially flagging it > > more than necessary. > > > > In principle, either outcome is technically OK and we don't have to loop > > if the cmpxchg doesn't work. That said, if we think there might be a > > later i_version available, then I think we probably want to try to query > > it again so we can return as late a one as possible. > > OK, makes sense. I'm just a bit vary of cmpxchg loops as they tend to > behave pretty badly in contended cases but I guess i_version won't be > hammered *that* hard. > That's the principle I'm operating under here, and I think it's valid for almost all workloads. Incrementing the i_version on parallel writes should be mostly uncontended now, whereas they at had to serialize on the i_lock before. The pessimal case here, I think is parallel increments and queries. We may see that sort of workload under knfsd, but I'm fine with giving knfsd a small performance hit to help performance on other workloads. While we're on the subject of looping here, should I add a cpu_relax() into these loops? Thanks for the review so far!
diff --git a/include/linux/fs.h b/include/linux/fs.h index 76382c24e9d0..6804d075933e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -639,7 +639,7 @@ struct inode { struct hlist_head i_dentry; struct rcu_head i_rcu; }; - u64 i_version; + atomic64_t i_version; atomic_t i_count; atomic_t i_dio_count; atomic_t i_writecount; diff --git a/include/linux/iversion.h b/include/linux/iversion.h index e08c634779df..a9fbf99709df 100644 --- a/include/linux/iversion.h +++ b/include/linux/iversion.h @@ -5,6 +5,8 @@ #include <linux/fs.h> /* + * The inode->i_version field: + * --------------------------- * The change attribute (i_version) is mandated by NFSv4 and is mostly for * knfsd, but is also used for other purposes (e.g. IMA). The i_version must * appear different to observers if there was a change to the inode's data or @@ -27,86 +29,143 @@ * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.). * We consider these sorts of filesystems to have a kernel-managed i_version. * + * This implementation uses the low bit in the i_version field as a flag to + * track when the value has been queried. If it has not been queried since it + * was last incremented, we can skip the increment in most cases. + * + * In the event that we're updating the ctime, we will usually go ahead and + * bump the i_version anyway. Since that has to go to stable storage in some + * fashion, we might as well increment it as well. + * + * With this implementation, the value should always appear to observers to + * increase over time if the file has changed. It's recommended to use + * inode_cmp_iversion() helper to compare values. + * * Note that some filesystems (e.g. NFS and AFS) just use the field to store - * a server-provided value (for the most part). For that reason, those + * a server-provided value for the most part. For that reason, those * filesystems do not set SB_I_VERSION. These filesystems are considered to * have a self-managed i_version. + * + * Persistently storing the i_version + * ---------------------------------- + * Queries of the i_version field are not gated on them hitting the backing + * store. It's always possible that the host could crash after allowing + * a query of the value but before it has made it to disk. + * + * To mitigate this problem, filesystems should always use + * inode_set_iversion_queried when loading an existing inode from disk. This + * ensures that the next attempted inode increment will result in the value + * changing. + * + * Storing the value to disk therefore does not count as a query, so those + * filesystems should use inode_peek_iversion to grab the value to be stored. + * There is no need to flag the value as having been queried in that case. */ +/* + * We borrow the lowest bit in the i_version to use as a flag to tell whether + * it has been queried since we last incremented it. If it has, then we must + * increment it on the next change. After that, we can clear the flag and + * avoid incrementing it again until it has again been queried. + */ +#define I_VERSION_QUERIED_SHIFT (1) +#define I_VERSION_QUERIED (1ULL << (I_VERSION_QUERIED_SHIFT - 1)) +#define I_VERSION_INCREMENT (1ULL << I_VERSION_QUERIED_SHIFT) + /** * inode_set_iversion_raw - set i_version to the specified raw value * @inode: inode to set - * @new: new i_version value to set + * @val: new i_version value to set * - * Set @inode's i_version field to @new. This function is for use by + * Set @inode's i_version field to @val. This function is for use by * filesystems that self-manage the i_version. * * For example, the NFS client stores its NFSv4 change attribute in this way, * and the AFS client stores the data_version from the server here. */ static inline void -inode_set_iversion_raw(struct inode *inode, const u64 new) +inode_set_iversion_raw(struct inode *inode, const u64 val) { - inode->i_version = new; + atomic64_set(&inode->i_version, val); } /** * inode_set_iversion - set i_version to a particular value * @inode: inode to set - * @new: new i_version value to set + * @val: new i_version value to set * - * Set @inode's i_version field to @new. This function is for filesystems with - * a kernel-managed i_version. + * Set @inode's i_version field to @val. This function is for filesystems with + * a kernel-managed i_version, for initializing a newly-created inode from + * scratch. * - * For now, this just does the same thing as the _raw variant. + * In this case, we do not set the QUERIED flag since we know that this value + * has never been queried. */ static inline void -inode_set_iversion(struct inode *inode, const u64 new) +inode_set_iversion(struct inode *inode, const u64 val) { - inode_set_iversion_raw(inode, new); + inode_set_iversion_raw(inode, val << I_VERSION_QUERIED_SHIFT); } /** - * inode_set_iversion_queried - set i_version to a particular value and set - * flag to indicate that it has been viewed + * inode_set_iversion_queried - set i_version to a particular value as quereied * @inode: inode to set - * @new: new i_version value to set + * @val: new i_version value to set * - * When loading in an i_version value from a backing store, we typically don't - * know whether it was previously viewed before being stored or not. Thus, we - * must assume that it was, to ensure that any changes will result in the - * value changing. + * Set @inode's i_version field to @val, and flag it for increment on the next + * change. * - * This function will set the inode's i_version, and possibly flag the value - * as if it has already been viewed at least once. + * Filesystems that persistently store the i_version on disk should use this + * when loading an existing inode from disk. * - * For now, this just does what inode_set_iversion does. + * When loading in an i_version value from a backing store, we can't be certain + * that it wasn't previously viewed before being stored. Thus, we must assume + * that it was, to ensure that we don't end up handing out the same value for + * different versions of the same inode. */ static inline void -inode_set_iversion_queried(struct inode *inode, const u64 new) +inode_set_iversion_queried(struct inode *inode, const u64 val) { - inode_set_iversion(inode, new); + inode_set_iversion_raw(inode, (val << I_VERSION_QUERIED_SHIFT) | + I_VERSION_QUERIED); } /** * inode_maybe_inc_iversion - increments i_version * @inode: inode with the i_version that should be updated - * @force: increment the counter even if it's not necessary + * @force: increment the counter even if it's not necessary? * * Every time the inode is modified, the i_version field must be seen to have * changed by any observer. * - * In this implementation, we always increment it after taking the i_lock to - * ensure that we don't race with other incrementors. + * If "force" is set or the QUERIED flag is set, then ensure that we increment + * the value, and clear the queried flag. + * + * In the common case where neither is set, then we can return "false" without + * updating i_version. * - * Returns true if counter was bumped, and false if it wasn't. + * If this function returns false, and no other metadata has changed, then we + * can avoid logging the metadata. */ static inline bool inode_maybe_inc_iversion(struct inode *inode, bool force) { - atomic64_t *ivp = (atomic64_t *)&inode->i_version; + u64 cur, old, new; - atomic64_inc(ivp); + cur = (u64)atomic64_read(&inode->i_version); + for (;;) { + /* If flag is clear then we needn't do anything */ + if (!force && !(cur & I_VERSION_QUERIED)) + return false; + + /* Since lowest bit is flag, add 2 to avoid it */ + new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT; + + old = atomic64_cmpxchg(&inode->i_version, cur, new); + if (likely(old == cur)) + break; + cur = old; + } return true; } @@ -124,21 +183,6 @@ inode_inc_iversion(struct inode *inode) inode_maybe_inc_iversion(inode, true); } -/** - * inode_iversion_need_inc - is the i_version in need of being incremented? - * @inode: inode to check - * - * Returns whether the inode->i_version counter needs incrementing on the next - * change. - * - * For now, we assume that it always does. - */ -static inline bool -inode_iversion_need_inc(struct inode *inode) -{ - return true; -} - /** * inode_peek_iversion_raw - grab a "raw" iversion value * @inode: inode from which i_version should be read @@ -153,7 +197,20 @@ inode_iversion_need_inc(struct inode *inode) static inline u64 inode_peek_iversion_raw(const struct inode *inode) { - return inode->i_version; + return atomic64_read(&inode->i_version); +} + +/** + * inode_iversion_need_inc - is the i_version in need of being incremented? + * @inode: inode to check + * + * Returns whether the inode->i_version counter needs incrementing on the next + * change. Just fetch the value and check the QUERIED flag. + */ +static inline bool +inode_iversion_need_inc(struct inode *inode) +{ + return inode_peek_iversion_raw(inode) & I_VERSION_QUERIED; } /** @@ -170,7 +227,7 @@ inode_peek_iversion_raw(const struct inode *inode) static inline u64 inode_peek_iversion(const struct inode *inode) { - return inode_peek_iversion_raw(inode); + return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT; } /** @@ -182,12 +239,28 @@ inode_peek_iversion(const struct inode *inode) * that a later query of the i_version will result in a different value if * anything has changed. * - * This implementation just does a peek. + * In this implementation, we fetch the current value, set the QUERIED flag and + * then try to swap it into place with a cmpxchg, if it wasn't already set. If + * that fails, we try again with the newly fetched value from the cmpxchg. */ static inline u64 inode_query_iversion(struct inode *inode) { - return inode_peek_iversion(inode); + u64 cur, old, new; + + cur = atomic64_read(&inode->i_version); + for (;;) { + /* If flag is already set, then no need to swap */ + if (cur & I_VERSION_QUERIED) + break; + + new = cur | I_VERSION_QUERIED; + old = atomic64_cmpxchg(&inode->i_version, cur, new); + if (old == cur) + break; + cur = old; + } + return cur >> I_VERSION_QUERIED_SHIFT; } /** @@ -196,11 +269,18 @@ inode_query_iversion(struct inode *inode) * @old: old value to check against its i_version * * Compare an i_version counter with a previous one. Returns 0 if they are - * the same or non-zero if they are different. + * the same, a positive value if the one in the inode appears newer than @old, + * and a negative value if @old appears to be newer than the one in the + * inode. + * + * Note that we don't need to set the QUERIED flag in this case, as the value + * in the inode is not being recorded for later use. */ + static inline s64 inode_cmp_iversion(const struct inode *inode, const u64 old) { - return (s64)inode_peek_iversion(inode) - (s64)old; + return (s64)(inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED) - + (s64)(old << I_VERSION_QUERIED_SHIFT); } #endif