Message ID | 20240613001215.648829-2-mjguzik@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | stop lockref from degrading to locked-only ops | expand |
On Wed, 12 Jun 2024 at 17:12, Mateusz Guzik <mjguzik@gmail.com> wrote: > > While I did not try to figure out who transiently took the lock (it was > something outside of the benchmark), I devised a trivial reproducer > which triggers the problem almost every time: merely issue "ls" of the > directory containing the tested file (in this case: "ls /tmp"). So I have no problem with your patch 2/2 - moving the lockref data structure away from everything else that can be shared read-only makes a ton of sense independently of anything else. Except you also randomly increased a retry count in there, which makes no sense. But this patch 1/2 makes me go "Eww, hacky hacky". We already *have* the retry loop, it's just that currently it only covers the cmpxchg failures. The natural thing to do is to just make the "wait for unlocked" be part of the same loop. In fact, I have this memory of trying this originally, and it not mattering and just making the code uglier, but that may be me confusing myself. It's a *loong* time ago. With the attached patch, lockref_get() (to pick one random case) ends up looking like this: mov (%rdi),%rax mov $0x64,%ecx loop: test %eax,%eax jne locked mov %rax,%rdx sar $0x20,%rdx add $0x1,%edx shl $0x20,%rdx lock cmpxchg %rdx,(%rdi) jne fail // SUCCESS ret locked: pause mov (%rdi),%rax fail: sub $0x1,%ecx jne loop (with the rest being the "take lock and go slow" case). It seems much better to me to have *one* retry loop that handles both the causes of failures. Entirely untested, I only looked at the generated code and it looked reasonable. The patch may be entirely broken for some random reason I didn't think of. And in case you wonder, that 'lockref_locked()' macro I introduce is purely to make the code more readable. Without it, that one conditional line ends up being insanely long, the macro is there just to break things up into slightly more manageable chunks. Mind testing this approach instead? Linus
On Wed, 12 Jun 2024 at 18:23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The natural thing to do is to just make the "wait for unlocked" be > part of the same loop. Oh, and while I like my approach a lot more than your patch, I do think that the real issue here is likely that something takes the d_lock way too much. One of the ideas behind the reflux was that locking should be an exceptional thing when something special happens. So things like final dput() and friends. What I *think* is going on - judging by your description of how you triggered this - is that sadly our libfs 'readdir()' thing is pretty nasty. It does use d_lock a lot for the cursor handling, and things like scan_positives() in particular. I understand *why* it does that, and maybe it's practically unfixable, but I do think the most likely deeper reason for that "go into slow mode" is the cursor on the directory causing issues. Put another way: while I think doing the retry loop will help benchmarks, it would be lovely if you were to look at that arguably deeper issue of the 'd_sib' list. Linus
On Wed, Jun 12, 2024 at 06:23:18PM -0700, Linus Torvalds wrote: > So I have no problem with your patch 2/2 - moving the lockref data > structure away from everything else that can be shared read-only makes > a ton of sense independently of anything else. > > Except you also randomly increased a retry count in there, which makes no sense. > Cmon man, that's a change which unintentionally crept into the patch and I failed to notice. > But this patch 1/2 makes me go "Eww, hacky hacky". > > We already *have* the retry loop, it's just that currently it only > covers the cmpxchg failures. > > The natural thing to do is to just make the "wait for unlocked" be > part of the same loop. > I was playing with a bpftrace probe reporting how many spins were performed waiting for the lock. For my intentional usage with ls it was always < 30 or so. The random-ass intruder which messes with my bench on occasion needed over 100. The bump is something I must have fat-fingered into the wrong patch. Ultimately even if *some* iterations will still take the lock, they should be able to avoid it next time around, so the permanent degradation problem is still solved. > Mind testing this approach instead? > So I'm not going to argue about the fix. I tested your code and confirm it fixes the problem, nothing blows up either and I fwiw I don't see any bugs in it. When writing the commit message feel free to use mine in whatever capacity (including none) you want. On Wed, Jun 12, 2024 at 06:49:00PM -0700, Linus Torvalds wrote: > On Wed, 12 Jun 2024 at 18:23, Linus Torvalds > One of the ideas behind the reflux was that locking should be an > exceptional thing when something special happens. So things like final > dput() and friends. > > What I *think* is going on - judging by your description of how you > triggered this - is that sadly our libfs 'readdir()' thing is pretty > nasty. > > It does use d_lock a lot for the cursor handling, and things like > scan_positives() in particular. > > I understand *why* it does that, and maybe it's practically unfixable, > but I do think the most likely deeper reason for that "go into slow > mode" is the cursor on the directory causing issues. > > Put another way: while I think doing the retry loop will help > benchmarks, it would be lovely if you were to look at that arguably > deeper issue of the 'd_sib' list. > I think lockref claiming to be a general locking facility means it should not be suffering the degradation problem to begin with, so that would be the real problem as far as lockref goes. For vfs as a whole I do agree that the d_lock usage in various spots is probably avoidable and if so would be nice to get patched out, readdir included. So Someone(tm) should certainly look into it. As for me at the moment I have other stuff on my todo list, so I am not going to do it for the time being. Regardless, patching up lockref to dodge the lock is a step forward in the right direction AFAICS. ===== All that aside, you did not indicate how do you want to move forward regarding patch submission. As indicated in my cover letter the vfs change (if it is to land) needs to be placed *after* the lockref issue is addressed, otherwise it may result in bogus regression reports. Thus I think it makes most sense to just ship them together. So maybe you can send me a patch and I send a v2 of the patchset with that, alternatively perhaps you can edit out the unintional retry adjustment from my dentry patch and handle the rest yourself. Or maybe you have some other idea. The thing that matters to me is not landing in a state where d_lockref is moved around, but the lockref corner case is not patched (even it is patched *after*). I really don't want to investigate a regression report only to find it's caused by the above.
> All that aside, you did not indicate how do you want to move forward > regarding patch submission. I've picked Linus patch and your for testing into the vfs.inode.rcu branch. Was trivial to fix your typo and to add Linus as author with your commit message. Let's see what syzbot and that perf bot have to say.
On Thu, Jun 13, 2024 at 3:46 PM Christian Brauner <brauner@kernel.org> wrote: > > > All that aside, you did not indicate how do you want to move forward > > regarding patch submission. > > I've picked Linus patch and your for testing into the vfs.inode.rcu branch. > Was trivial to fix your typo and to add Linus as author with your commit > message. Let's see what syzbot and that perf bot have to say. sounds good to me, thanks
On Wed, 12 Jun 2024 at 23:10, Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Wed, Jun 12, 2024 at 06:23:18PM -0700, Linus Torvalds wrote: > > So I have no problem with your patch 2/2 - moving the lockref data > > structure away from everything else that can be shared read-only makes > > a ton of sense independently of anything else. > > > > Except you also randomly increased a retry count in there, which makes no sense. > > Cmon man, that's a change which unintentionally crept into the patch and > I failed to notice. Heh. It wasn't entirely obvious that it was unintentional, since the series very much was about that whole rety thing. But good. > I was playing with a bpftrace probe reporting how many spins were > performed waiting for the lock. For my intentional usage with ls it was > always < 30 or so. The random-ass intruder which messes with my bench on > occasion needed over 100. Ok. So keeping it at 100 is likely fine. Of course, one option is to simply get rid of the retry count entirely, and just make it all be entirely unconditional. The fallback to locking is not technically required at all for the USE_CMPXCHG_LOCKREF case. That has some unfairness issues, though. But my point is mostly that the count is probably not hugely important or meaningful. > I tested your code and confirm it fixes the problem, nothing blows up > either and I fwiw I don't see any bugs in it. I was more worried about some fat-fingering than any huge conceptual bugs, and any minor testing with performance checks would find that. Just as an example: my first attempt switched the while(likely(..)) to the if (unlikely(..)) in the loop, but didn't add the "!" to negate the test. I caught it immediately and obviously never sent that broken thing out (and it was one reason why I decided I needed to make the code more readable with that lockref_locked() helper macro). But that's mainly the kind of thing I was worried about. > When writing the commit message feel free to use mine in whatever capacity > (including none) you want. Ack. > I think lockref claiming to be a general locking facility means it > should not be suffering the degradation problem to begin with, so that > would be the real problem as far as lockref goes. Well, it was certainly originally meant to be generic, but after more than a decade, the number of users is three. And the two non-dcache users are pretty darn random. So it's effectively really just a dcache special case with two filesystems that started using it just because the authors had clearly seen the vfs use of it... > All that aside, you did not indicate how do you want to move forward > regarding patch submission. lockref is fairly unusual, and is pretty much mine. The initial concept was from Waiman Long: https://lore.kernel.org/all/1372268603-46748-1-git-send-email-Waiman.Long@hp.com/ but I ended up redoing it pretty much from scratch, and credited him as author for the initial commit. It's _technically_ a locking thing, but realistically it has never been locking tree and it's effectively a dcache thing. Linus
On Thu, 13 Jun 2024 at 06:46, Christian Brauner <brauner@kernel.org> wrote: > > I've picked Linus patch and your for testing into the vfs.inode.rcu branch. Btw, if you added [patch 2/2] too, I think the exact byte counts in the comments are a bit misleading. The actual cacheline details will very much depend on 32-bit vs 64-bit builds, but also on things like the slab allocator debug settings. I think the important part is to keep the d_lockref - that is often dirty and exclusive in the cache - away from the mostly RO parts of the dentry that can be shared across CPU's in the cache. So rather than talk about exact byte offsets, maybe just state that overall rule? Linus
On Thu, Jun 13, 2024 at 7:00 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 13 Jun 2024 at 06:46, Christian Brauner <brauner@kernel.org> wrote: > > > > I've picked Linus patch and your for testing into the vfs.inode.rcu branch. > > Btw, if you added [patch 2/2] too, I think the exact byte counts in > the comments are a bit misleading. > > The actual cacheline details will very much depend on 32-bit vs 64-bit > builds, but also on things like the slab allocator debug settings. > I indeed assumed "x86_64 production", with lines just copied from pahole. However, to the best of my understanding the counts are what one should expect on a 64-bit kernel. That said: > I think the important part is to keep the d_lockref - that is often > dirty and exclusive in the cache - away from the mostly RO parts of > the dentry that can be shared across CPU's in the cache. > > So rather than talk about exact byte offsets, maybe just state that > overall rule? > I would assume the rule is pretty much well known and instead an indicator where is what (as added in my comments) would be welcome. But if going that route, then perhaps: "Make sure d_lockref does not share a cacheline with fields used by RCU lookup, otherwise it can result in a signification throughput reduction. You can use pahole(1) to check the layout." [maybe a link to perfbook or something goes here?] Arguably a bunch of BUILD_BUG_ONs could be added to detect the overlap (only active without whatever runtime debug messing with the layout).
On Thu, Jun 13, 2024 at 8:13 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Thu, Jun 13, 2024 at 7:00 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Thu, 13 Jun 2024 at 06:46, Christian Brauner <brauner@kernel.org> wrote: > > > > > > I've picked Linus patch and your for testing into the vfs.inode.rcu branch. > > > > Btw, if you added [patch 2/2] too, I think the exact byte counts in > > the comments are a bit misleading. > > > > The actual cacheline details will very much depend on 32-bit vs 64-bit > > builds, but also on things like the slab allocator debug settings. > > > > I indeed assumed "x86_64 production", with lines just copied from pahole. > > However, to the best of my understanding the counts are what one > should expect on a 64-bit kernel. > > That said: > > > I think the important part is to keep the d_lockref - that is often > > dirty and exclusive in the cache - away from the mostly RO parts of > > the dentry that can be shared across CPU's in the cache. > > > > So rather than talk about exact byte offsets, maybe just state that > > overall rule? > > > > I would assume the rule is pretty much well known and instead an > indicator where is what (as added in my comments) would be welcome. > > But if going that route, then perhaps: > "Make sure d_lockref does not share a cacheline with fields used by > RCU lookup, otherwise it can result in a signification throughput > reduction. You can use pahole(1) to check the layout." > [maybe a link to perfbook or something goes here?] > > Arguably a bunch of BUILD_BUG_ONs could be added to detect the overlap > (only active without whatever runtime debug messing with the layout). > So I tried to add the BUILD_BUG_ONs but I got some compilation errors immediately concerning the syntax, maybe I'm brainfarting here. I am not pasting that in case you want to take a stab yourself from scratch. I did however type up the following: /* * Note: dentries have a read-mostly area heavily used by RCU (denoted with * "RCU lookup touched fields") which must not share cachelines with a * frequently written-to field like d_lockref. * * If you are modifying the layout you can check with pahole(1) that there is * no overlap. */ could land just above the struct. That's my $0,03. I am not going to give further commentary on the matter, you guys touch it up however you see fit.
On Thu, 13 Jun 2024 at 11:13, Mateusz Guzik <mjguzik@gmail.com> wrote: > > I would assume the rule is pretty much well known and instead an > indicator where is what (as added in my comments) would be welcome. Oh, the rule is well-known, but I think what is worth pointing out is the different classes of fields, and the name[] field in particular. This ordering was last really updated back in 2011, by commit 44a7d7a878c9 ("fs: cache optimise dentry and inode for rcu-walk"). And it was actually somewhat intentional at the time. Quoting from that commit: We also fit in 8 bytes of inline name in the first 64 bytes, so for short names, only 64 bytes needs to be touched to perform the lookup. We should get rid of the hash->prev pointer from the first 64 bytes, and fit 16 bytes of name in there, which will take care of 81% rather than 32% of the kernel tree. but what has actually really changed - and that I didn't even realize until I now did a 'pahole' on it, was that this was all COMPLETELY broken by seqcount_spinlock_t d_seq; because seqcount_spinlock_t has been entirely broken and went from being 4 bytes back when, to now being 64 bytes. Crazy crazy. How did that happen? Linus
On Thu, Jun 13, 2024 at 8:43 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 13 Jun 2024 at 11:13, Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > I would assume the rule is pretty much well known and instead an > > indicator where is what (as added in my comments) would be welcome. > > Oh, the rule is well-known, but I think what is worth pointing out is > the different classes of fields, and the name[] field in particular. > > This ordering was last really updated back in 2011, by commit > 44a7d7a878c9 ("fs: cache optimise dentry and inode for rcu-walk"). And > it was actually somewhat intentional at the time. Quoting from that > commit: > > We also fit in 8 bytes of inline name in the first 64 bytes, so for short > names, only 64 bytes needs to be touched to perform the lookup. We should > get rid of the hash->prev pointer from the first 64 bytes, and fit 16 bytes > of name in there, which will take care of 81% rather than 32% of the kernel > tree. > Right. Things degrading by accident is why I suggested BUILD_BUG_ON. > but what has actually really changed - and that I didn't even realize > until I now did a 'pahole' on it, was that this was all COMPLETELY > broken by > > seqcount_spinlock_t d_seq; > > because seqcount_spinlock_t has been entirely broken and went from > being 4 bytes back when, to now being 64 bytes. > > Crazy crazy. > > How did that happen? > perhaps lockdep in your config? this is the layout on my prod config: struct dentry { unsigned int d_flags; /* 0 4 */ seqcount_spinlock_t d_seq; /* 4 4 */ struct hlist_bl_node d_hash; /* 8 16 */ struct dentry * d_parent; /* 24 8 */ struct qstr d_name; /* 32 16 */ struct inode * d_inode; /* 48 8 */ unsigned char d_iname[40]; /* 56 40 */ /* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */ const struct dentry_operations * d_op; /* 96 8 */ struct super_block * d_sb; /* 104 8 */ long unsigned int d_time; /* 112 8 */ void * d_fsdata; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ struct lockref d_lockref __attribute__((__aligned__(8))); /* 128 8 */ union { struct list_head d_lru; /* 136 16 */ wait_queue_head_t * d_wait; /* 136 8 */ }; /* 136 16 */ struct hlist_node d_sib; /* 152 16 */ struct hlist_head d_children; /* 168 8 */ union { struct hlist_node d_alias; /* 176 16 */ struct hlist_bl_node d_in_lookup_hash; /* 176 16 */ struct callback_head d_rcu __attribute__((__aligned__(8))); /* 176 16 */ } d_u __attribute__((__aligned__(8))); /* 176 16 */ /* size: 192, cachelines: 3, members: 16 */ /* forced alignments: 2 */ } __attribute__((__aligned__(8)));
On Thu, Jun 13, 2024 at 11:43:05AM -0700, Linus Torvalds wrote: > seqcount_spinlock_t d_seq; > > because seqcount_spinlock_t has been entirely broken and went from > being 4 bytes back when, to now being 64 bytes. 1ca7d67cf5d5 "seqcount: Add lockdep functionality to seqcount/seqlock structures"
On Thu, 13 Jun 2024 at 11:48, Mateusz Guzik <mjguzik@gmail.com> wrote: > > perhaps lockdep in your config? Yes, happily it was just lockdep, and the fact that my regular tree doesn't have debug info, so I looked at my allmodconfig tree. I didn't *think* anything in the dentry struct should care about debugging, but clearly that sequence number thing did. But with that fixed, it's still the case that "we used to know about this", but what you actually fixed is the case of larger names than 8 bytes. You did mention the name clashing in your commit message, but I think that should be the important part in the code comments too: make them about "these fields are hot and pretty much read-only", "these fields don't matter" and "this field is hot and written, and shouldn't be near the read-only ones". The exact byte counts may change, the basic notion doesn't. (Of course, putting it at the *end* of the structure then possibly causes cache conflicts with the next one - we don't force the dentries be cacheline aligned even if we've tried to make them generally work that way) Linus
On Thu, Jun 13, 2024 at 8:56 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 13 Jun 2024 at 11:48, Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > perhaps lockdep in your config? > > Yes, happily it was just lockdep, and the fact that my regular tree > doesn't have debug info, so I looked at my allmodconfig tree. > > I didn't *think* anything in the dentry struct should care about > debugging, but clearly that sequence number thing did. > > But with that fixed, it's still the case that "we used to know about > this", but what you actually fixed is the case of larger names than 8 > bytes. > > You did mention the name clashing in your commit message, but I think > that should be the important part in the code comments too: make them > about "these fields are hot and pretty much read-only", "these fields > don't matter" and "this field is hot and written, and shouldn't be > near the read-only ones". > > The exact byte counts may change, the basic notion doesn't. > > (Of course, putting it at the *end* of the structure then possibly > causes cache conflicts with the next one - we don't force the dentries > be cacheline aligned even if we've tried to make them generally work > that way) > Look mate, I'm not going to go back and forth about this bit. Nobody is getting a turing award for moving a field elsewhere, so as far as I am concerned you are free to write your own version of the patch, I don't even need to be mentioned. You are also free to grab the commit message in whatever capacity (I guess you would at least bench results?). As long as lockref (the facility) and d_lockref are out of the way I'm a happy camper.
On Thu, 13 Jun 2024 at 11:56, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I didn't *think* anything in the dentry struct should care about > debugging, but clearly that sequence number thing did. Looking at the 32-bit build, it looks like out current 'struct dentry' is 136 bytes in size, not 128. Looks like DNAME_INLINE_LEN should be reduced to 36 on 32-bit. And moving d_lockref to after d_fsdata works there too. Not that anybody really cares, but let's make sure it's actually properly done when this is changed. Christian? Linus
On Thu, Jun 13, 2024 at 12:33:59PM GMT, Linus Torvalds wrote: > On Thu, 13 Jun 2024 at 11:56, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I didn't *think* anything in the dentry struct should care about > > debugging, but clearly that sequence number thing did. > > Looking at the 32-bit build, it looks like out current 'struct dentry' > is 136 bytes in size, not 128. > > Looks like DNAME_INLINE_LEN should be reduced to 36 on 32-bit. > > And moving d_lockref to after d_fsdata works there too. > > Not that anybody really cares, but let's make sure it's actually > properly done when this is changed. Christian? So I verified that indeed on i386 with CONFIG_SMP we need 36 to get to 128 bytes and yes, moving d_lockref after d_fsdata works. So I've put that patch on top. (Sorry, took a bit. I'm back tomorrow.)
diff --git a/lib/lockref.c b/lib/lockref.c index 2afe4c5d8919..596b521bc1f1 100644 --- a/lib/lockref.c +++ b/lib/lockref.c @@ -26,10 +26,46 @@ } \ } while (0) +/* + * Most routines only ever inc/dec the reference, but the lock may be + * transiently held forcing them to take it as well. + * + * Should the lock be taken for any reason (including outside of lockref), + * a steady stream of ref/unref requests may find itself unable to go back + * to lockless operation. + * + * Combat the problem by giving the routines a way to speculatively wait in + * hopes of avoiding having to take the lock. + * + * The spin count is limited to guarantee forward progress, although the + * value is arbitrarily chosen. + * + * Note this routine is only used if the lock was found to be taken. + */ +static inline bool lockref_trywait_unlocked(struct lockref *lockref) +{ + struct lockref old; + int retry = 100; + + for (;;) { + cpu_relax(); + old.lock_count = READ_ONCE(lockref->lock_count); + if (arch_spin_value_unlocked(old.lock.rlock.raw_lock)) + return true; + if (!--retry) + return false; + } +} + #else #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0) +static inline bool lockref_trywait_unlocked(struct lockref *lockref) +{ + return false; +} + #endif /** @@ -47,6 +83,14 @@ void lockref_get(struct lockref *lockref) return; ); + if (lockref_trywait_unlocked(lockref)) { + CMPXCHG_LOOP( + new.count++; + , + return; + ); + } + spin_lock(&lockref->lock); lockref->count++; spin_unlock(&lockref->lock); @@ -70,6 +114,16 @@ int lockref_get_not_zero(struct lockref *lockref) return 1; ); + if (lockref_trywait_unlocked(lockref)) { + CMPXCHG_LOOP( + new.count++; + if (old.count <= 0) + return 0; + , + return 1; + ); + } + spin_lock(&lockref->lock); retval = 0; if (lockref->count > 0) { @@ -98,6 +152,16 @@ int lockref_put_not_zero(struct lockref *lockref) return 1; ); + if (lockref_trywait_unlocked(lockref)) { + CMPXCHG_LOOP( + new.count--; + if (old.count <= 1) + return 0; + , + return 1; + ); + } + spin_lock(&lockref->lock); retval = 0; if (lockref->count > 1) { @@ -125,6 +189,17 @@ int lockref_put_return(struct lockref *lockref) , return new.count; ); + + if (lockref_trywait_unlocked(lockref)) { + CMPXCHG_LOOP( + new.count--; + if (old.count <= 0) + return -1; + , + return new.count; + ); + } + return -1; } EXPORT_SYMBOL(lockref_put_return); @@ -181,6 +256,16 @@ int lockref_get_not_dead(struct lockref *lockref) return 1; ); + if (lockref_trywait_unlocked(lockref)) { + CMPXCHG_LOOP( + new.count++; + if (old.count < 0) + return 0; + , + return 1; + ); + } + spin_lock(&lockref->lock); retval = 0; if (lockref->count >= 0) {
The usual inc/dec scheme is to try to do it with atomics if the lock is not taken and fallback to locked operation otherwise. If the lock is only transiently held routines could instead speculatively wait and avoid the acquire altogether. This has a minor benefit of slightly speeding up cases of the sort, which do happen in real workloads. More importantly this fixes a corner case where lockref consumers degrade to locked operation and are unable to go back to atomics. Consider a continuous stream of lockref ops from several threads and another thread taking the lock for a brief moment. Any lockref user which manages to observe the lock as taken is going to fallback to locking it itself, but this increases the likelihood that more threads executing the code will do the same. This eventually can culminate in nobody being able to go back to atomics on the problematic lockref. While this is not a state which can persist in a real workload for anything but few calls, it very much can permanently degrade when benchmarking and in consequence grossly disfigure results. Here is an example of will-it-scale doing 20-way stat(2) on the same file residing on tmpfs, reports once per second with number of ops completed: [snip several ok results] min:417297 max:426249 total:8401766 <-- expected performance min:219024 max:221764 total:4398413 <-- some locking started min:62855 max:64949 total:1273712 <-- everyone degraded min:62472 max:64873 total:1268733 min:62179 max:63843 total:1256375 [snip remaining bad results] While I did not try to figure out who transiently took the lock (it was something outside of the benchmark), I devised a trivial reproducer which triggers the problem almost every time: merely issue "ls" of the directory containing the tested file (in this case: "ls /tmp"). The problem does not persist with the fix below. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> --- lib/lockref.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+)