Message ID | 20231110042041.GL1957730@ZenIV (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | lockless case of retain_dentry() (was Re: [PATCH 09/15] fold the call of retain_dentry() into fast_dput()) | expand |
On Thu, 9 Nov 2023 at 20:20, Al Viro <viro@zeniv.linux.org.uk> wrote: > > FWIW, on top of current #work.dcache2 the following delta might be worth > looking into. Not sure if it's less confusing that way, though - I'd been staring > at that place for too long. Code generation is slightly suboptimal with recent > gcc, but only marginally so. I doubt the pure ALU ops and a couple of extra conditional branches (that _probably_ predict well) matter at all. Especially since this is all after lockref_put_return() has done that locked cmpxchg, which *is* expensive. My main reaction is that we use hlist_bl_unhashed() for d_unhashed(), and we *intentionally* make it separate from the actual unhasing: - ___d_drop() does the __hlist_bl_del() - but d_unhashed() does hlist_bl_unhashed(), which checks d_hash.pprev == NULL, and that's done by __d_drop We even have a comment about this: * ___d_drop doesn't mark dentry as "unhashed" * (dentry->d_hash.pprev will be LIST_POISON2, not NULL). and we depend on this in __d_move(), which will unhash things temporarily, but not mark things unhashed, because they get re-hashed again. Same goes for __d_add(). Anyway, what I'm actually getting at in a roundabout way is that maybe we should make D_UNHASHED be another flag in d_flags, and *not* use that d_hash.pprev field, and that would allow us to combine even more of these tests in dput(), because now pretty much *all* of those "retain_dentry()" checks would be about d_flags bits. Hmm? As it is, it has that odd combination of d_flags and that d_unhashed() test, so it's testing two different fields. Anyway, I really don't think it matters much, but since you brought up the whole suboptimal code generation.. I tried to look at dput() code generation, and it doesn't look horrendous as-is in your dcache2 branch. If anything, the thing that hirs is the lockref_put_return() being out-of-line even though this is basically the only caller, plus people have pessimized the arch_spin_value_unlocked() implementation *again*, so that it uses a volatile read, when the *WHOLE*POINT* of that "VALUE" part of "value_unlocked()" is that we've already read the value, and we should *not* re-read it. Damn. The bug seems to affect both the generic qspinlock code, and the ticket-based one. For the ticket based ones, it's PeterZ and commit 1bce11126d57 ("asm-generic: ticket-lock: New generic ticket-based spinlock"), which does static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return !arch_spin_is_locked(&lock); } where we've got that "lock" value, but then it takes the address of it, and uses arch_spin_is_locked() on it, so now it will force a flush to memory, and then an READ_ONCE() on it. And for the qspinlock code, we had a similar static __always_inline int queued_spin_value_unlocked(struct qspinlock lock) { return !atomic_read(&lock.val); } thing, where it does 'atomic_read()' on the value it was passed as an argument. Stupid, stupid. It's literally forcing a re-read of a value that is guaranteed to be on stack. I know this worked at some point, but that may have been many years ago, since I haven't looked at this part of lockref code generation in ages. Anway, as a result now all the lockref functions will do silly "store the old lockref value to memory, in order to read it again" dances in that CMPXCHG_LOOP() loop. It literally makes that whole "is this an unlocked value" function completely pointless. The *whole* and only point was "look, I already loaded the value from memory, is this *VALUE* unlocked. Compared to that complete braindamage in the fast-path loop, the small extra ALU ops in fast_dput() are nothing. Peter - those functions are done exactly the wrong way around. arch_spin_is_locked() should be implemented using arch_spin_value_unlocked(), not this way around. And the queued spinlocks should not do an atomic_read()of the argument they get, they should just do "!lock.val.counter" So something like this should fix lockref. ENTIRELY UNTESTED, except now the code generation of lockref_put_return() looks much better, without a pointless flush to the stack, and now it has no pointless stack frame as a result. Of course, it should probably be inlined, since it has only one user (ok, two, since fast_dput() gets used twice), and that should make the return value testing much better. Linus
On Thu, 9 Nov 2023 at 21:57, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So something like this should fix lockref. ENTIRELY UNTESTED, except > now the code generation of lockref_put_return() looks much better, > without a pointless flush to the stack, and now it has no pointless > stack frame as a result. Heh. And because I was looking at Al's tree, I didn't notice that commit c6f4a9002252 ("asm-generic: ticket-lock: Optimize arch_spin_value_unlocked()") had solved the ticket spinlock part of this in this merge window in the meantime. The qspinlock implementation - which is what x86 uses - is still broken in mainline, though. So that part of my patch still stands. Now attached just the small one-liner part. Adding Ingo and Guo Ren, who did the ticket lock part (and looks to have done it very similarly to my suggested patch. Ingo? Linus
On Thu, Nov 09, 2023 at 09:57:39PM -0800, Linus Torvalds wrote: > Anyway, what I'm actually getting at in a roundabout way is that maybe > we should make D_UNHASHED be another flag in d_flags, and *not* use > that d_hash.pprev field, and that would allow us to combine even more > of these tests in dput(), because now pretty much *all* of those > "retain_dentry()" checks would be about d_flags bits. > > Hmm? As it is, it has that odd combination of d_flags and that > d_unhashed() test, so it's testing two different fields. Hmm, indeed. The trouble is, we are getting tight on the ->d_flags bits. Only two unassigned bits left (0x08000000 and 0x80000000). DCACHE_COOKIE is defined (0x00002000), but unused. Should've been taken out when dcookie stuff went. DCACHE_DENTRY_KILLED might be mergable with DCACHE_MAY_FREE now; worth looking into. In effect, DCACHE_MAY_FREE is set iff we have both DCACHE_DENTRY_KILLED and DCACHE_SHRINK_LIST - and the only place that checks it is guaranteed to have had DCACHE_SHRINK_LIST. Actually, that's nice - in terms of dentry states we have refcount > 0 <=> Busy refcount == 0 <=> Retained refcount < 0 && !KILLED <=> Dying refcount < 0 && KILLED && !SHRINK_LIST <=> Freeing refcount < 0 && KILLED && SHRINK_LIST <=> Husk. <makes a note in the docs being written> DCACHE_FALLTRHU is odd - it's never checked (or set, for that matter); might be killable, might be intended for some overlayfs plans. DCACHE_GENOCIDE might become killable, what with selinuxfs patch I've got (apparently OK with selinux folks, will sort it out after -rc1). OK, it's not as awful as I thought - one more bit won't hurt. I'll go through the unlocked callers and see if any of those is sensitive to separating setting that flag from hash list removal. There might be dragons... > Anyway, I really don't think it matters much, but since you brought up > the whole suboptimal code generation.. FWIW, it's not all that suboptimal, at least with current gcc. The thing I'm really not sure about is whether that patch makes the whole thing easier to follow - probably need to let it sit around for a week or so, then look at it again; right now I don't trust my taste regarding that particular change, having spent too much time today mucking with it ;-/
On Thu, Nov 09, 2023 at 10:22:13PM -0800, Linus Torvalds wrote: > On Thu, 9 Nov 2023 at 21:57, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So something like this should fix lockref. ENTIRELY UNTESTED, except > > now the code generation of lockref_put_return() looks much better, > > without a pointless flush to the stack, and now it has no pointless > > stack frame as a result. > > Heh. And because I was looking at Al's tree, I didn't notice that > commit c6f4a9002252 ("asm-generic: ticket-lock: Optimize > arch_spin_value_unlocked()") had solved the ticket spinlock part of > this in this merge window in the meantime. > > The qspinlock implementation - which is what x86 uses - is still > broken in mainline, though. > > So that part of my patch still stands. Now attached just the small > one-liner part. Adding Ingo and Guo Ren, who did the ticket lock part > (and looks to have done it very similarly to my suggested patch. Not only generic ticket lock, I think Will Deacon recognized the lockref problem of the arm32 ticket-lock in 2013. After my patch merged, I think riscv could also select ARCH_USE_CMPXCHG_LOCKREF in its Kconfig. Ref: commit 0cbad9c9dfe0c38e8ec7385b39087c005a6dee3e Author: Will Deacon <will@kernel.org> Date: Wed Oct 9 17:19:22 2013 +0100 ARM: 7854/1: lockref: add support for lockless lockrefs using cmpxchg64 Our spinlocks are only 32-bit (2x16-bit tickets) and, on processors with 64-bit atomic instructions, cmpxchg64 makes use of the double-word exclusive accessors. This patch wires up the cmpxchg-based lockless lockref implementation for ARM. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 1ad6fb6c094d..fc184bcd7848 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -5,6 +5,7 @@ config ARM select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAVE_CUSTOM_GPIO_H + select ARCH_USE_CMPXCHG_LOCKREF select ARCH_WANT_IPC_PARSE_VERSION select BUILDTIME_EXTABLE_SORT if MMU select CLONE_BACKWARDS diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h index 4f2c28060c9a..ed6c22919e47 100644 --- a/arch/arm/include/asm/spinlock.h +++ b/arch/arm/include/asm/spinlock.h @@ -127,10 +127,14 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock) dsb_sev(); } +static inline int arch_spin_value_unlocked(arch_spinlock_t lock) +{ + return lock.tickets.owner == lock.tickets.next; +} + static inline int arch_spin_is_locked(arch_spinlock_t *lock) { - struct __raw_tickets tickets = ACCESS_ONCE(lock->tickets); - return tickets.owner != tickets.next; + return !arch_spin_value_unlocked(ACCESS_ONCE(*lock)); } static inline int arch_spin_is_contended(arch_spinlock_t *lock) > > Ingo? > > Linus > include/asm-generic/qspinlock.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h > index 995513fa2690..0655aa5b57b2 100644 > --- a/include/asm-generic/qspinlock.h > +++ b/include/asm-generic/qspinlock.h > @@ -70,7 +70,7 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock) > */ > static __always_inline int queued_spin_value_unlocked(struct qspinlock lock) > { > - return !atomic_read(&lock.val); > + return !lock.val.counter; > } > > /**
On Thu, Nov 09, 2023 at 09:57:39PM -0800, Linus Torvalds wrote: > On Thu, 9 Nov 2023 at 20:20, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > FWIW, on top of current #work.dcache2 the following delta might be worth > > looking into. Not sure if it's less confusing that way, though - I'd been staring > > at that place for too long. Code generation is slightly suboptimal with recent > > gcc, but only marginally so. > > I doubt the pure ALU ops and a couple of extra conditional branches > (that _probably_ predict well) matter at all. > > Especially since this is all after lockref_put_return() has done that > locked cmpxchg, which *is* expensive. > > My main reaction is that we use hlist_bl_unhashed() for d_unhashed(), > and we *intentionally* make it separate from the actual unhasing: > > - ___d_drop() does the __hlist_bl_del() > > - but d_unhashed() does hlist_bl_unhashed(), which checks > d_hash.pprev == NULL, and that's done by __d_drop > > We even have a comment about this: > > * ___d_drop doesn't mark dentry as "unhashed" > * (dentry->d_hash.pprev will be LIST_POISON2, not NULL). > > and we depend on this in __d_move(), which will unhash things > temporarily, but not mark things unhashed, because they get re-hashed > again. Same goes for __d_add(). > > Anyway, what I'm actually getting at in a roundabout way is that maybe > we should make D_UNHASHED be another flag in d_flags, and *not* use > that d_hash.pprev field, and that would allow us to combine even more > of these tests in dput(), because now pretty much *all* of those > "retain_dentry()" checks would be about d_flags bits. > > Hmm? As it is, it has that odd combination of d_flags and that > d_unhashed() test, so it's testing two different fields. > > Anyway, I really don't think it matters much, but since you brought up > the whole suboptimal code generation.. > > I tried to look at dput() code generation, and it doesn't look > horrendous as-is in your dcache2 branch. > > If anything, the thing that hirs is the lockref_put_return() being > out-of-line even though this is basically the only caller, plus people > have pessimized the arch_spin_value_unlocked() implementation *again*, > so that it uses a volatile read, when the *WHOLE*POINT* of that > "VALUE" part of "value_unlocked()" is that we've already read the > value, and we should *not* re-read it. > > Damn. > > The bug seems to affect both the generic qspinlock code, and the > ticket-based one. > > For the ticket based ones, it's PeterZ and commit 1bce11126d57 > ("asm-generic: ticket-lock: New generic ticket-based spinlock"), which > does > > static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) > { > return !arch_spin_is_locked(&lock); > } > > where we've got that "lock" value, but then it takes the address of > it, and uses arch_spin_is_locked() on it, so now it will force a flush > to memory, and then an READ_ONCE() on it. > > And for the qspinlock code, we had a similar We discussed x86 qspinlock code generation. It looked not too bad as I thought because qspinlock_spin_value_unlocked is much cheaper than the ticket-lock. But the riscv ticket-lock code generation is terrible because of the shift left & right 16-bit. https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@gmail.com > > static __always_inline int queued_spin_value_unlocked(struct qspinlock lock) > { > return !atomic_read(&lock.val); > } > > thing, where it does 'atomic_read()' on the value it was passed as an argument. > > Stupid, stupid. It's literally forcing a re-read of a value that is > guaranteed to be on stack. > > I know this worked at some point, but that may have been many years > ago, since I haven't looked at this part of lockref code generation in > ages. > > Anway, as a result now all the lockref functions will do silly "store > the old lockref value to memory, in order to read it again" dances in > that CMPXCHG_LOOP() loop. > > It literally makes that whole "is this an unlocked value" function > completely pointless. The *whole* and only point was "look, I already > loaded the value from memory, is this *VALUE* unlocked. > > Compared to that complete braindamage in the fast-path loop, the small > extra ALU ops in fast_dput() are nothing. > > Peter - those functions are done exactly the wrong way around. > arch_spin_is_locked() should be implemented using > arch_spin_value_unlocked(), not this way around. > > And the queued spinlocks should not do an atomic_read()of the argument > they get, they should just do "!lock.val.counter" > > So something like this should fix lockref. ENTIRELY UNTESTED, except > now the code generation of lockref_put_return() looks much better, > without a pointless flush to the stack, and now it has no pointless > stack frame as a result. > > Of course, it should probably be inlined, since it has only one user > (ok, two, since fast_dput() gets used twice), and that should make the > return value testing much better. > > Linus > include/asm-generic/qspinlock.h | 2 +- > include/asm-generic/spinlock.h | 17 +++++++++-------- > 2 files changed, 10 insertions(+), 9 deletions(-) > > diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h > index 995513fa2690..0655aa5b57b2 100644 > --- a/include/asm-generic/qspinlock.h > +++ b/include/asm-generic/qspinlock.h > @@ -70,7 +70,7 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock) > */ > static __always_inline int queued_spin_value_unlocked(struct qspinlock lock) > { > - return !atomic_read(&lock.val); > + return !lock.val.counter; > } > > /** > diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h > index fdfebcb050f4..a35eda0ec2a2 100644 > --- a/include/asm-generic/spinlock.h > +++ b/include/asm-generic/spinlock.h > @@ -68,11 +68,17 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) > smp_store_release(ptr, (u16)val + 1); > } > > +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) > +{ > + u32 val = lock.counter; > + return ((val >> 16) == (val & 0xffff)); > +} > + > static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock) > { > - u32 val = atomic_read(lock); > - > - return ((val >> 16) != (val & 0xffff)); > + arch_spinlock_t val; > + val.counter = atomic_read(lock); > + return !arch_spin_value_unlocked(val); > } > > static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) > @@ -82,11 +88,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) > return (s16)((val >> 16) - (val & 0xffff)) > 1; > } > > -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) > -{ > - return !arch_spin_is_locked(&lock); > -} > - > #include <asm/qrwlock.h> > > #endif /* __ASM_GENERIC_SPINLOCK_H */
On Tue, 21 Nov 2023 at 23:19, Guo Ren <guoren@kernel.org> wrote: > > We discussed x86 qspinlock code generation. It looked not too bad as I > thought because qspinlock_spin_value_unlocked is much cheaper than the > ticket-lock. But the riscv ticket-lock code generation is terrible > because of the shift left & right 16-bit. > https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@gmail.com No, it's not the 16-bit shifts in the spin_value_unlocked() check, that just generates simple and straightforward code: a0: 0107569b srlw a3,a4,0x10 a4: 00c77733 and a4,a4,a2 a8: 04e69063 bne a3,a4,e8 <.L12> (plus two stupid instructions for generating the immediate in a2 for 0xffff, but hey, that's the usual insane RISC-V encoding thing - you can load a 20-bit U-immediate only shifted up by 12, if it's in the lower bits you're kind of screwed and limited to 12-bit immediates). The *bad* code generation is from the much simpler new.count++; which sadly neither gcc not clang is quite smart enough to understand that "hey, I can do that in 64 bits". It's incrementing the higher 32-bit word in a 64-bit union, and with a smarter compiler it *should* basically become lock_count += 1 << 32; but the compiler isn't that clever, so it splits the 64-bit word into two 32-bit words, increments one of them, and then merges the two words back into 64 bits: 98: 4207d693 sra a3,a5,0x20 9c: 02079713 sll a4,a5,0x20 a0: 0016869b addw a3,a3,1 a4: 02069693 sll a3,a3,0x20 a8: 02075713 srl a4,a4,0x20 ac: 00d76733 or a4,a4,a3 which is pretty sad. If you want to do the optimization that the compiler misses by hand, it would be something like the attached patch. NOTE! Very untested. But that *should* cause the compiler to just generate a single "add" instruction (in addition to generating the constant 0x100000000, of course). Of course, on a LL/SC architecture like RISC-V, in an *optimal* world, the whole sequence would actually be done with one single LL/SC, rather than the "load,add,cmpxchg" thing. But then you'd have to do absolutely everything by hand in assembly. Linus
On Wed, 22 Nov 2023 at 09:20, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > If you want to do the optimization that the compiler misses by hand, > it would be something like the attached patch. Bah. Might as well do the reference decrements with the same logic, not just the increments. Of course, this is much more noticeable with the ticket locks, because with the qspinlocks the "is this unlocked" test will check whether the lock is all zeroes. So with qspinlocks, the compiler sees that "oh, the low 32 bits are zero", and the whole "merge the two words back to 64 bits" is much cheaper, and doesn't generate quite the mess that it does for RISC-V with ticket locks. But this "treat the lockref as a 64-bit entity" thing is probably a good thing on most 64-bit architectures, including x86 that has that qspinlock thing. Still not actually tested, but the code generation on x86 looks reasonable, so it migth be worth looking at whether it helps the RISC-V case. Linus
On Wed, 22 Nov 2023 at 09:52, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Bah. Might as well do the reference decrements with the same logic, > not just the increments. And thanks for reminding me about this issue. I just committed the trivial one-liner fix for qspinlock code generation that apparently never went anywhere (mostly my own damn fault for not having pushed it enough and made a proper commit message). Linus
On Wed, 22 Nov 2023 at 09:52, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Still not actually tested, but the code generation on x86 looks > reasonable, so it migth be worth looking at whether it helps the > RISC-V case. Doing some more munging, and actually looking at RISC-V code generation too (I obviously had to enable ARCH_USE_CMPXCHG_LOCKREF for RISC-V). I get this: lockref_get: addi sp,sp,-32 sd s0,16(sp) sd s1,8(sp) sd ra,24(sp) addi s0,sp,32 li a1,65536 ld a5,0(a0) mv s1,a0 addi a1,a1,-1 li a0,100 .L43: sext.w a3,a5 li a4,1 srliw a2,a5,16 and a3,a3,a1 slli a4,a4,32 bne a2,a3,.L49 add a4,a5,a4 0: lr.d a3, 0(s1) bne a3, a5, 1f sc.d.rl a2, a4, 0(s1) bnez a2, 0b fence rw, rw 1: bne a5,a3,.L52 ld ra,24(sp) ld s0,16(sp) ld s1,8(sp) addi sp,sp,32 jr ra ... so now that single update is indeed just one single instruction: add a4,a5,a4 is that "increment count in the high 32 bits". The ticket lock being unlocked checks are those li a1,65536 sext.w a3,a5 srliw a2,a5,16 and a3,a3,a1 bne a2,a3,.L49 instructions if I read it right. That actually looks fairly close to optimal, although the frame setup is kind of sad. (The above does not include the "loop if the cmpxchg failed" part of the code generation) Anyway, apart from enabling LOCKREF, the patch to get this for RISC-V is attached. I'm not going to play with this any more, but you might want to check whether this actually does work on RISC-V. Becaue I only looked at the code generation, I didn't actually look at whether it *worked*. Linus
On Wed, Nov 22, 2023 at 09:20:53AM -0800, Linus Torvalds wrote: > On Tue, 21 Nov 2023 at 23:19, Guo Ren <guoren@kernel.org> wrote: > > > > We discussed x86 qspinlock code generation. It looked not too bad as I > > thought because qspinlock_spin_value_unlocked is much cheaper than the > > ticket-lock. But the riscv ticket-lock code generation is terrible > > because of the shift left & right 16-bit. > > https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@gmail.com > > No, it's not the 16-bit shifts in the spin_value_unlocked() check, > that just generates simple and straightforward code: > > a0: 0107569b srlw a3,a4,0x10 > a4: 00c77733 and a4,a4,a2 > a8: 04e69063 bne a3,a4,e8 <.L12> > > (plus two stupid instructions for generating the immediate in a2 for > 0xffff, but hey, that's the usual insane RISC-V encoding thing - you > can load a 20-bit U-immediate only shifted up by 12, if it's in the > lower bits you're kind of screwed and limited to 12-bit immediates). > > The *bad* code generation is from the much simpler > > new.count++; > > which sadly neither gcc not clang is quite smart enough to understand > that "hey, I can do that in 64 bits". > > It's incrementing the higher 32-bit word in a 64-bit union, and with a > smarter compiler it *should* basically become > > lock_count += 1 << 32; > > but the compiler isn't that clever, so it splits the 64-bit word into > two 32-bit words, increments one of them, and then merges the two > words back into 64 bits: > > 98: 4207d693 sra a3,a5,0x20 > 9c: 02079713 sll a4,a5,0x20 > a0: 0016869b addw a3,a3,1 > a4: 02069693 sll a3,a3,0x20 > a8: 02075713 srl a4,a4,0x20 > ac: 00d76733 or a4,a4,a3 > > which is pretty sad. 9c & a8 is for word-zero-extend; riscv would have zext.w in the future. Your patch may improve above with: li a4,1 slli a4,a4,32 add a4,a5,a4 v.s. sra a3,a5,0x20 zext.w a4,a5 addw a3,a3,1 or a4,a4,a3 You win one instruction "or a4,a4,a3", which is less than one cycle. The zext.w is important, and it could replace sll+srl a lot, so I think it's a current ISA design short. Here, what I want to improve is to prevent stack frame setup in the fast path, and that's the most benefit my patch could give out. Unnecessary memory access is the most important performance killer in SMP. My patch removes the stack frame setup from the fast path. void lockref_get(struct lockref *lockref) { 78: 00053783 ld a5,0(a0) 000000000000007c <.LBB212>: 7c: 00010637 lui a2,0x10 0000000000000080 <.LBE212>: 80: 06400593 li a1,100 0000000000000084 <.LBB216>: 84: fff60613 add a2,a2,-1 # ffff <.LLST8+0xf4aa> 0000000000000088 <.L8>: 88: 0007871b sext.w a4,a5 000000000000008c <.LBB217>: 8c: 0107d69b srlw a3,a5,0x10 90: 00c77733 and a4,a4,a2 94: 04e69063 bne a3,a4,d4 <.L12> --------+ | 0000000000000098 <.LBB218>: | 98: 4207d693 sra a3,a5,0x20 | 9c: 02079713 sll a4,a5,0x20 | a0: 0016869b addw a3,a3,1 | a4: 02069693 sll a3,a3,0x20 | a8: 02075713 srl a4,a4,0x20 | ac: 00d76733 or a4,a4,a3 | | 00000000000000b0 <.L0^B1>: | b0: 100536af lr.d a3,(a0) | b4: 00f69863 bne a3,a5,c4 <.L1^B1> | b8: 1ae5382f sc.d.rl a6,a4,(a0) | bc: fe081ae3 bnez a6,b0 <.L0^B1> | c0: 0330000f fence rw,rw | | 00000000000000c4 <.L1^B1>: | c4: 04d78a63 beq a5,a3,118 <.L18> | | 00000000000000c8 <.LBE228>: | c8: fff5859b addw a1,a1,-1 | | 00000000000000cc <.LBB229>: | cc: 00068793 mv a5,a3 | | 00000000000000d0 <.LBE229>: | d0: fa059ce3 bnez a1,88 <.L8> | | 00000000000000d4 <.L12>: <--------------------------------------+ { slow_path d4: fe010113 add sp,sp,-32 d8: 00113c23 sd ra,24(sp) dc: 00813823 sd s0,16(sp) e0: 02010413 add s0,sp,32 > > If you want to do the optimization that the compiler misses by hand, > it would be something like the attached patch. > > NOTE! Very untested. But that *should* cause the compiler to just > generate a single "add" instruction (in addition to generating the > constant 0x100000000, of course). > > Of course, on a LL/SC architecture like RISC-V, in an *optimal* world, > the whole sequence would actually be done with one single LL/SC, > rather than the "load,add,cmpxchg" thing. > > But then you'd have to do absolutely everything by hand in assembly. No, it's not worth to do that. - There are only atomic primitives in Linux, but no ll/sc primitive in the real world. The world belongs to AMO and the only usage of ll/sc is to implement AMO and CAS. - Using single ll/sc primitive instead of cmpxchg is similar to your patch, and you may win 1 cycle or not. - The critical work here are reducing bus transactions, preventing cache dance, and forward progress guarantee. Here is my optimization advice: #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ int retry = 100; \ struct lockref old; \ BUILD_BUG_ON(sizeof(old) != 8); \ + prefetchw(lockref); \ old.lock_count = READ_ONCE(lockref->lock_count); \ while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ struct lockref new = old; \ CODE \ if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ &old.lock_count, \ new.lock_count))) { \ SUCCESS; \ } \ Micro-arch could give prefetchw more guarantee: - Prefetch.w must guarantee cache line exclusiveness even when a shareable state cache line hits. - Hold the exclusive cache line for several cycles until the next store or timeout - Mask interrupt during the holding cycles (Optional) The lockref slow path is killed in this micro-architecture, which means there is no chance to execute the spinlock. I've written down more details in my ppt: https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing This type of prefetchw could help large-size atomic operations within one cache line. Compared to the transaction memory model, prefetchw could give a forward progress guarantee and easier landing in Linux without any new primitive. > > Linus > lib/lockref.c | 17 ++++++++++++++--- > 1 file changed, 14 insertions(+), 3 deletions(-) > > diff --git a/lib/lockref.c b/lib/lockref.c > index 2afe4c5d8919..481b102a6476 100644 > --- a/lib/lockref.c > +++ b/lib/lockref.c > @@ -26,6 +26,17 @@ > } \ > } while (0) > > +/* > + * The compiler isn't smart enough to the the count > + * increment in the high 32 bits of the 64-bit value, > + * so do this optimization by hand. > + */ > +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64 > + #define LOCKREF_INC(n) ((n).lock_count += 1ul<<32) > +#else > + #define LOCKREF_INC(n) ((n).count++) > +#endif > + > #else > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0) > @@ -42,7 +53,7 @@ > void lockref_get(struct lockref *lockref) > { > CMPXCHG_LOOP( > - new.count++; > + LOCKREF_INC(new); > , > return; > ); > @@ -63,7 +74,7 @@ int lockref_get_not_zero(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count++; > + LOCKREF_INC(new); > if (old.count <= 0) > return 0; > , > @@ -174,7 +185,7 @@ int lockref_get_not_dead(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count++; > + LOCKREF_INC(new); > if (old.count < 0) > return 0; > ,
On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@kernel.org> wrote: > > Here is my optimization advice: > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > int retry = 100; \ > struct lockref old; \ > BUILD_BUG_ON(sizeof(old) != 8); \ > + prefetchw(lockref); \\ No. We're not adding software prefetches to generic code. Been there, done that. They *never* improve performance on good hardware. They end up helping on some random (usually particularly bad) microarchitecture, and then they hurt everybody else. And the real optimization advice is: "don't run on crap hardware". It really is that simple. Good hardware does OoO and sees the future write. > Micro-arch could give prefetchw more guarantee: Well, in practice, they never do, and in fact they are often buggy and cause problems because they weren't actually tested very much. Linus
On Sun, Nov 26, 2023 at 11:39:37AM -0500, Guo Ren wrote: > On Wed, Nov 22, 2023 at 09:20:53AM -0800, Linus Torvalds wrote: > > On Tue, 21 Nov 2023 at 23:19, Guo Ren <guoren@kernel.org> wrote: > > > > > > We discussed x86 qspinlock code generation. It looked not too bad as I > > > thought because qspinlock_spin_value_unlocked is much cheaper than the > > > ticket-lock. But the riscv ticket-lock code generation is terrible > > > because of the shift left & right 16-bit. > > > https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@gmail.com > > > > No, it's not the 16-bit shifts in the spin_value_unlocked() check, > > that just generates simple and straightforward code: > > > > a0: 0107569b srlw a3,a4,0x10 > > a4: 00c77733 and a4,a4,a2 > > a8: 04e69063 bne a3,a4,e8 <.L12> > > > > (plus two stupid instructions for generating the immediate in a2 for > > 0xffff, but hey, that's the usual insane RISC-V encoding thing - you > > can load a 20-bit U-immediate only shifted up by 12, if it's in the > > lower bits you're kind of screwed and limited to 12-bit immediates). > > > > The *bad* code generation is from the much simpler > > > > new.count++; > > > > which sadly neither gcc not clang is quite smart enough to understand > > that "hey, I can do that in 64 bits". > > > > It's incrementing the higher 32-bit word in a 64-bit union, and with a > > smarter compiler it *should* basically become > > > > lock_count += 1 << 32; > > > > but the compiler isn't that clever, so it splits the 64-bit word into > > two 32-bit words, increments one of them, and then merges the two > > words back into 64 bits: > > > > 98: 4207d693 sra a3,a5,0x20 > > 9c: 02079713 sll a4,a5,0x20 > > a0: 0016869b addw a3,a3,1 > > a4: 02069693 sll a3,a3,0x20 > > a8: 02075713 srl a4,a4,0x20 > > ac: 00d76733 or a4,a4,a3 > > > > which is pretty sad. > 9c & a8 is for word-zero-extend; riscv would have zext.w in the future. > Your patch may improve above with: > li a4,1 > slli a4,a4,32 > add a4,a5,a4 > > v.s. > sra a3,a5,0x20 > zext.w a4,a5 > addw a3,a3,1 > or a4,a4,a3 > You win one instruction "or a4,a4,a3", which is less than one cycle. Sorry, I forgot "sll a3,a3,0x20", so it's 1.5 cycles, but it didn't affect my opinion here; local core operations are the lower optimization priority than memory transations. > > The zext.w is important, and it could replace sll+srl a lot, so I think > it's a current ISA design short. > > Here, what I want to improve is to prevent stack frame setup in the fast > path, and that's the most benefit my patch could give out. Unnecessary > memory access is the most important performance killer in SMP. > > My patch removes the stack frame setup from the fast path. > void lockref_get(struct lockref *lockref) > { > 78: 00053783 ld a5,0(a0) > 000000000000007c <.LBB212>: > 7c: 00010637 lui a2,0x10 > > 0000000000000080 <.LBE212>: > 80: 06400593 li a1,100 > > 0000000000000084 <.LBB216>: > 84: fff60613 add a2,a2,-1 # ffff <.LLST8+0xf4aa> > > 0000000000000088 <.L8>: > 88: 0007871b sext.w a4,a5 > > 000000000000008c <.LBB217>: > 8c: 0107d69b srlw a3,a5,0x10 > 90: 00c77733 and a4,a4,a2 > 94: 04e69063 bne a3,a4,d4 <.L12> --------+ > | > 0000000000000098 <.LBB218>: | > 98: 4207d693 sra a3,a5,0x20 | > 9c: 02079713 sll a4,a5,0x20 | > a0: 0016869b addw a3,a3,1 | > a4: 02069693 sll a3,a3,0x20 | > a8: 02075713 srl a4,a4,0x20 | > ac: 00d76733 or a4,a4,a3 | > | > 00000000000000b0 <.L0^B1>: | > b0: 100536af lr.d a3,(a0) | > b4: 00f69863 bne a3,a5,c4 <.L1^B1> | > b8: 1ae5382f sc.d.rl a6,a4,(a0) | > bc: fe081ae3 bnez a6,b0 <.L0^B1> | > c0: 0330000f fence rw,rw | > | > 00000000000000c4 <.L1^B1>: | > c4: 04d78a63 beq a5,a3,118 <.L18> | > | > 00000000000000c8 <.LBE228>: | > c8: fff5859b addw a1,a1,-1 | > | > 00000000000000cc <.LBB229>: | > cc: 00068793 mv a5,a3 | > | > 00000000000000d0 <.LBE229>: | > d0: fa059ce3 bnez a1,88 <.L8> | > | > 00000000000000d4 <.L12>: <--------------------------------------+ > { slow_path > d4: fe010113 add sp,sp,-32 > d8: 00113c23 sd ra,24(sp) > dc: 00813823 sd s0,16(sp) > e0: 02010413 add s0,sp,32 > > > > > > If you want to do the optimization that the compiler misses by hand, > > it would be something like the attached patch. > > > > NOTE! Very untested. But that *should* cause the compiler to just > > generate a single "add" instruction (in addition to generating the > > constant 0x100000000, of course). > > > > Of course, on a LL/SC architecture like RISC-V, in an *optimal* world, > > the whole sequence would actually be done with one single LL/SC, > > rather than the "load,add,cmpxchg" thing. > > > > But then you'd have to do absolutely everything by hand in assembly. > No, it's not worth to do that. > - There are only atomic primitives in Linux, but no ll/sc primitive in > the real world. The world belongs to AMO and the only usage of ll/sc > is to implement AMO and CAS. > - Using single ll/sc primitive instead of cmpxchg is similar to your > patch, and you may win 1 cycle or not. > - The critical work here are reducing bus transactions, preventing > cache dance, and forward progress guarantee. > > Here is my optimization advice: > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > int retry = 100; \ > struct lockref old; \ > BUILD_BUG_ON(sizeof(old) != 8); \ > + prefetchw(lockref); \ > old.lock_count = READ_ONCE(lockref->lock_count); \ > while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > struct lockref new = old; \ > CODE \ > if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > &old.lock_count, \ > new.lock_count))) { \ > SUCCESS; \ > } \ > > Micro-arch could give prefetchw more guarantee: > - Prefetch.w must guarantee cache line exclusiveness even when a > shareable state cache line hits. > - Hold the exclusive cache line for several cycles until the next > store or timeout > - Mask interrupt during the holding cycles (Optional) > > The lockref slow path is killed in this micro-architecture, which > means there is no chance to execute the spinlock. > > I've written down more details in my ppt: > https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing > > This type of prefetchw could help large-size atomic operations within > one cache line. Compared to the transaction memory model, prefetchw > could give a forward progress guarantee and easier landing in Linux > without any new primitive. > > > > > Linus > > > lib/lockref.c | 17 ++++++++++++++--- > > 1 file changed, 14 insertions(+), 3 deletions(-) > > > > diff --git a/lib/lockref.c b/lib/lockref.c > > index 2afe4c5d8919..481b102a6476 100644 > > --- a/lib/lockref.c > > +++ b/lib/lockref.c > > @@ -26,6 +26,17 @@ > > } \ > > } while (0) > > > > +/* > > + * The compiler isn't smart enough to the the count > > + * increment in the high 32 bits of the 64-bit value, > > + * so do this optimization by hand. > > + */ > > +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64 > > + #define LOCKREF_INC(n) ((n).lock_count += 1ul<<32) > > +#else > > + #define LOCKREF_INC(n) ((n).count++) > > +#endif > > + > > #else > > > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0) > > @@ -42,7 +53,7 @@ > > void lockref_get(struct lockref *lockref) > > { > > CMPXCHG_LOOP( > > - new.count++; > > + LOCKREF_INC(new); > > , > > return; > > ); > > @@ -63,7 +74,7 @@ int lockref_get_not_zero(struct lockref *lockref) > > int retval; > > > > CMPXCHG_LOOP( > > - new.count++; > > + LOCKREF_INC(new); > > if (old.count <= 0) > > return 0; > > , > > @@ -174,7 +185,7 @@ int lockref_get_not_dead(struct lockref *lockref) > > int retval; > > > > CMPXCHG_LOOP( > > - new.count++; > > + LOCKREF_INC(new); > > if (old.count < 0) > > return 0; > > , > >
On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@kernel.org> wrote: > > Here, what I want to improve is to prevent stack frame setup in the fast > path, and that's the most benefit my patch could give out. Side note: what patch do you have that avoids the stack frame setup? Because I still saw the stack frame even with the arch_spin_value_unlocked() fix and the improved code generation. The compiler still does addi sp,sp,-32 sd s0,16(sp) sd s1,8(sp) sd ra,24(sp) addi s0,sp,32 at the top of the function for me - not because of the (now fixed) lock value spilling, but just because it wants to save registers. The reason seems to be that gcc isn't smart enough to delay the frame setup to the slow path where it then has to do the actual spinlock, so it has to generate a stack frame just for the return address and then it does the whole frame setup thing. I was using just the risc-v defconfig (with the cmpxchg lockrefs enabled, and spinlock debugging disabled so that lockrefs actually do something), so there might be some other config thing like "force frame pointers" that then causes problems. But while the current tree avoids the silly lock value spill and reload, and my patch improved the integer instruction selection, I really couldn't get rid of the stack frame entirely. The x86 code also ends up looking quite nice, although part of that is that the qspinlock test is a simple compare against zero: lockref_get: pushq %rbx movq %rdi, %rbx movq (%rdi), %rax movl $-100, %ecx movabsq $4294967296, %rdx .LBB0_1: testl %eax, %eax jne .LBB0_4 leaq (%rax,%rdx), %rsi lock cmpxchgq %rsi, (%rbx) je .LBB0_5 incl %ecx jne .LBB0_1 .LBB0_4: movq %rbx, %rdi callq _raw_spin_lock incl 4(%rbx) movb $0, (%rbx) .LBB0_5: popq %rbx retq (That 'movabsq' thing is what generates the big constant that adds '1' in the upper word - that add is then done as a 'leaq'). In this case, the 'retry' count is actually a noticeable part of the code generation, and is probably also why it has to save/restore '%rbx'. Oh well. We limited the cmpxchg loop because of horrible issues with starvation on bad arm64 cores. It turns out that SMP cacheline bouncing is hard, and if you haven't been doing it for a couple of decades, you'll do it wrong. You'll find out the hard way that the same is probably true on any early RISC-V SMP setups. You wanting to use prefetchw is a pretty clear indication of the same kind of thing. Linus
On Sun, 26 Nov 2023 at 09:06, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > In this case, the 'retry' count is actually a noticeable part of the > code generation, and is probably also why it has to save/restore > '%rbx'. Nope. The reason for having to save/restore a register is the spin_lock(&lockref->lock); lockref->count++; sequence: since spin_lock() is a function call, it will clobber all the registers that a function can clobber, and the callee has to keep the 'lockref' argument somewhere. So it needs a callee-saved register, which it then itself needs to save. Inlining the spinlock sequence entirely would fix it, but is the wrong thing to do for the slow case. Marking the spinlock functions with __attribute__((no_caller_saved_registers)) might actually be a reasonable option. It makes the spinlock itself more expensive (since now it saves/restores all the registers it uses), but in this case that's the right thing to do. Of course, in this case, lockref has already done the optimistic "check the lock" version, so our spinlock code that does that LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); which first tries to do the trylock, is all kinds of wrong. In a perfect world, the lockref code actually wants only the slow-path, since it has already done the fast-path case. And it would have that "slow path saves all registers" thing. That might be a good idea for spinlocks in general, who knows.. Oh well. Probably not worth worrying about. In my profiles, lockref looks pretty good even under heavy dentry load. Even if it's not perfect. Linus
On Wed, Nov 22, 2023 at 11:11:38AM -0800, Linus Torvalds wrote: > On Wed, 22 Nov 2023 at 09:52, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Still not actually tested, but the code generation on x86 looks > > reasonable, so it migth be worth looking at whether it helps the > > RISC-V case. > > Doing some more munging, and actually looking at RISC-V code > generation too (I obviously had to enable ARCH_USE_CMPXCHG_LOCKREF for > RISC-V). > > I get this: > > lockref_get: > addi sp,sp,-32 > sd s0,16(sp) > sd s1,8(sp) > sd ra,24(sp) > addi s0,sp,32 > li a1,65536 > ld a5,0(a0) > mv s1,a0 > addi a1,a1,-1 > li a0,100 > .L43: > sext.w a3,a5 > li a4,1 > srliw a2,a5,16 > and a3,a3,a1 > slli a4,a4,32 > bne a2,a3,.L49 > add a4,a5,a4 > 0: > lr.d a3, 0(s1) > bne a3, a5, 1f > sc.d.rl a2, a4, 0(s1) > bnez a2, 0b > fence rw, rw > 1: > bne a5,a3,.L52 > ld ra,24(sp) > ld s0,16(sp) > ld s1,8(sp) > addi sp,sp,32 > jr ra > ... > > so now that single update is indeed just one single instruction: > > add a4,a5,a4 > > is that "increment count in the high 32 bits". > > The ticket lock being unlocked checks are those > > li a1,65536 > sext.w a3,a5 > srliw a2,a5,16 > and a3,a3,a1 > bne a2,a3,.L49 > > instructions if I read it right. > > That actually looks fairly close to optimal, although the frame setup > is kind of sad. > > (The above does not include the "loop if the cmpxchg failed" part of > the code generation) > > Anyway, apart from enabling LOCKREF, the patch to get this for RISC-V > is attached. > > I'm not going to play with this any more, but you might want to check > whether this actually does work on RISC-V. > > Becaue I only looked at the code generation, I didn't actually look at > whether it *worked*. > > Linus > From 168f35850c15468941e597907e33daacd179d54a Mon Sep 17 00:00:00 2001 > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Wed, 22 Nov 2023 09:33:29 -0800 > Subject: [PATCH] lockref: improve code generation for ref updates > > Our lockref data structure is two 32-bit words laid out next to each > other, combining the spinlock and the count into one entity that can be > accessed atomically together. > > In particular, the structure is laid out so that the count is the upper > 32 bit word (on little-endian), so that you can do basic arithmetic on > the count in 64 bits: instead of adding one to the 32-bit word, you can > just add a value shifted by 32 to the full 64-bit word. > > Sadly, neither gcc nor clang are quite clever enough to work that out on > their own, so this does that "manually". > > Also, try to do any compares against zero values, which generally > improves the code generation. So rather than check that the value was > at least 1 before a decrement, check that it's positive or zero after > the decrement. We don't worry about the overflow point in lockrefs. Tested-by: Guo Ren <guoren@kernel.org> This patch could help riscv optimize 3 ALU instructions. Before the patch: 000000000000020c <lockref_get>: CMPXCHG_LOOP( 20c: 611c ld a5,0(a0) 000000000000020e <.LBB492>: 20e: 03079713 sll a4,a5,0x30 212: 0107d69b srlw a3,a5,0x10 216: 9341 srl a4,a4,0x30 218: 02e69663 bne a3,a4,244 <.L40> 000000000000021c <.LBB494>: 21c: 4207d693 sra a3,a5,0x20 -------+ 220: 02079713 sll a4,a5,0x20 | 224: 2685 addw a3,a3,1 | 226: 1682 sll a3,a3,0x20 | 228: 9301 srl a4,a4,0x20 | 22a: 8f55 or a4,a4,a3 -------+ 000000000000022c <.L0^B4>: 22c: 100536af lr.d a3,(a0) 230: 00f69763 bne a3,a5,23e <.L1^B5> 234: 1ae5362f sc.d.rl a2,a4,(a0) 238: fa75 bnez a2,22c <.L0^B4> 23a: 0330000f fence rw,rw After the patch: 000000000000020c <lockref_get>: CMPXCHG_LOOP( 20c: 611c ld a5,0(a0) 000000000000020e <.LBB526>: 20e: 03079713 sll a4,a5,0x30 212: 0107d69b srlw a3,a5,0x10 216: 9341 srl a4,a4,0x30 218: 02e69163 bne a3,a4,23a <.L40> 000000000000021c <.LBB528>: 21c: 4705 li a4,1 ------+ 21e: 1702 sll a4,a4,0x20 | 220: 973e add a4,a4,a5 ------+ 0000000000000222 <.L0^B4>: 222: 100536af lr.d a3,(a0) 226: 00f69763 bne a3,a5,234 <.L1^B5> 22a: 1ae5362f sc.d.rl a2,a4,(a0) 22e: fa75 bnez a2,222 <.L0^B4> 230: 0330000f fence rw,rw > > Cc: Guo Ren <guoren@kernel.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > --- > lib/lockref.c | 29 ++++++++++++++++++++--------- > 1 file changed, 20 insertions(+), 9 deletions(-) > > diff --git a/lib/lockref.c b/lib/lockref.c > index 2afe4c5d8919..f3c30c538af1 100644 > --- a/lib/lockref.c > +++ b/lib/lockref.c > @@ -26,6 +26,17 @@ > } \ > } while (0) > > +/* > + * The compiler isn't smart enough to the the count > + * increment in the high 32 bits of the 64-bit value, > + * so do this optimization by hand. > + */ > +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64 > + #define LOCKREF_ADD(n,x) ((n).lock_count += (unsigned long)(x)<<32) > +#else > + #define LOCKREF_ADD(n,x) ((n).count += (unsigned long)(x)<<32) > +#endif > + > #else > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0) > @@ -42,7 +53,7 @@ > void lockref_get(struct lockref *lockref) > { > CMPXCHG_LOOP( > - new.count++; > + LOCKREF_ADD(new,1); > , > return; > ); > @@ -63,9 +74,9 @@ int lockref_get_not_zero(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count++; > if (old.count <= 0) > return 0; > + LOCKREF_ADD(new,1); > , > return 1; > ); > @@ -91,8 +102,8 @@ int lockref_put_not_zero(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count--; > - if (old.count <= 1) > + LOCKREF_ADD(new,-1); > + if (new.count <= 0) > return 0; > , > return 1; > @@ -119,8 +130,8 @@ EXPORT_SYMBOL(lockref_put_not_zero); > int lockref_put_return(struct lockref *lockref) > { > CMPXCHG_LOOP( > - new.count--; > - if (old.count <= 0) > + LOCKREF_ADD(new,-1); > + if (new.count < 0) > return -1; > , > return new.count; > @@ -137,8 +148,8 @@ EXPORT_SYMBOL(lockref_put_return); > int lockref_put_or_lock(struct lockref *lockref) > { > CMPXCHG_LOOP( > - new.count--; > - if (old.count <= 1) > + LOCKREF_ADD(new,-1); > + if (new.count <= 0) > break; > , > return 1; > @@ -174,9 +185,9 @@ int lockref_get_not_dead(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count++; > if (old.count < 0) > return 0; > + LOCKREF_ADD(new,1); > , > return 1; > ); > -- > 2.43.0.5.g38fb137bdb >
On Sun, Nov 26, 2023 at 09:06:03AM -0800, Linus Torvalds wrote: > On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@kernel.org> wrote: > > > > Here, what I want to improve is to prevent stack frame setup in the fast > > path, and that's the most benefit my patch could give out. > > Side note: what patch do you have that avoids the stack frame setup? > Because I still saw the stack frame even with the > arch_spin_value_unlocked() fix and the improved code generation. The > compiler still does > > addi sp,sp,-32 > sd s0,16(sp) > sd s1,8(sp) > sd ra,24(sp) > addi s0,sp,32 I found below affect you: #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ - int retry = 100; \ struct lockref old; \ BUILD_BUG_ON(sizeof(old) != 8); \ old.lock_count = READ_ONCE(lockref->lock_count); \ @@ -21,11 +20,21 @@ new.lock_count))) { \ SUCCESS; \ } \ - if (!--retry) \ - break; \ Yes, The 'retry' count patch [1] hurts us. [1]: https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m091df9dca68c27c28f8f69a72cae0e1361dba4fa > > at the top of the function for me - not because of the (now fixed) > lock value spilling, but just because it wants to save registers. > > The reason seems to be that gcc isn't smart enough to delay the frame > setup to the slow path where it then has to do the actual spinlock, so > it has to generate a stack frame just for the return address and then > it does the whole frame setup thing. > > I was using just the risc-v defconfig (with the cmpxchg lockrefs > enabled, and spinlock debugging disabled so that lockrefs actually do > something), so there might be some other config thing like "force > frame pointers" that then causes problems. > > But while the current tree avoids the silly lock value spill and > reload, and my patch improved the integer instruction selection, I > really couldn't get rid of the stack frame entirely. The x86 code also > ends up looking quite nice, although part of that is that the > qspinlock test is a simple compare against zero: > > lockref_get: > pushq %rbx > movq %rdi, %rbx > movq (%rdi), %rax > movl $-100, %ecx > movabsq $4294967296, %rdx > .LBB0_1: > testl %eax, %eax > jne .LBB0_4 > leaq (%rax,%rdx), %rsi > lock > cmpxchgq %rsi, (%rbx) > je .LBB0_5 > incl %ecx > jne .LBB0_1 > .LBB0_4: > movq %rbx, %rdi > callq _raw_spin_lock > incl 4(%rbx) > movb $0, (%rbx) > .LBB0_5: > popq %rbx > retq > > (That 'movabsq' thing is what generates the big constant that adds '1' > in the upper word - that add is then done as a 'leaq'). > > In this case, the 'retry' count is actually a noticeable part of the > code generation, and is probably also why it has to save/restore > '%rbx'. Oh well. We limited the cmpxchg loop because of horrible > issues with starvation on bad arm64 cores. It turns out that SMP > cacheline bouncing is hard, and if you haven't been doing it for a > couple of decades, you'll do it wrong. > > You'll find out the hard way that the same is probably true on any > early RISC-V SMP setups. You wanting to use prefetchw is a pretty > clear indication of the same kind of thing. The 'retry' count is bad solution, which hides the problem. ThunderX2's problem is mainly about unnecessary cpu_relax & cacheline sticky less. In the AMBA 5 CHI spec "Home behavior" section says: [2] "When a Home(CIU/LLcache) determines that an Exclusive Store transaction has failed, the following rules must be followed: If the Requester has lost the cache line, then the Home is expected to send SnpPreferUniqueFwd or SnpPreferUnique to get a copy of the cache line." The SnpPreferUnique is not SnpUnique, which means it would return a shared cacheline in case of serious contention. No guarantee for the next cmpxchg. But, we want a unique cache line right? You said: [1] "... And once one CPU gets ownership of the line, it doesn't lose it immediately, so the next cmpxchg will *succeed*. So at most, the *first* cmpxchg will fail (because that's the one that was fed not by a previous cmpxchg, but by a regular load (which we'd *like* to do as a "load-for-ownership" load, but we don't have the interfaces to do that). But the second cmpxchg should basically always succeed, ..." (Sorry, I quoted you like this.) My argue is: Why do we need to wait for cmpxchg failure? You have the "load-for-ownership" interface: "prefetchw"! lockref_get: pushq %rbx +------prefetchw (%rdi) --------> doesn't lose it immediately, st| so the next cmpxchg will *succeed* ic| - Linus ky| movq %rdi, %rbx t| movq (%rdi), %rax ------> local acquire, comfortable! im| movl $-100, %ecx e | movabsq $4294967296, %rdx |.LBB0_1: | testl %eax, %eax | jne .LBB0_4 | leaq (%rax,%rdx), %rsi | lock | cmpxchgq %rsi, (%rbx) --> local cas, success! +----- je .LBB0_5 ------> Farewell to the slowpath! If x86 is not a crap machine, "movq+movq+movl+movabsq+testl+jne+leak+ cmpxchg" should be fast enough to satisfy the sticky time. The prefetchw primitive has been defined in include/linux/prefetch.h for many years. The prefetchw has been used for generic code: ➜ linux git:(master) ✗ grep prefetchw mm/ fs/ kernel/ -r mm/slub.c: prefetchw(object + s->offset); mm/slab.c: prefetchw(objp); mm/page_alloc.c: prefetchw(p); mm/page_alloc.c: prefetchw(p + 1); mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field) \ mm/vmscan.c: prefetchw(&prev->_field); \ mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field) do { } while (0) mm/vmscan.c: prefetchw_prev_lru_folio(folio, src, flags); fs/mpage.c: prefetchw(&folio->flags); fs/f2fs/data.c: prefetchw(&page->flags); fs/ext4/readpage.c: prefetchw(&folio->flags); kernel/bpf/cpumap.c: prefetchw(page); kernel/locking/qspinlock.c: prefetchw(next); ➜ linux git:(master) ✗ grep prefetchw drivers/ -r | wc -l 80 The prefetchw is okay for all good hardware. Not like the 'retry' one. [1]: https://lore.kernel.org/lkml/CAHk-=wiEahkwDXpoy=-SzJHNMRXKVSjPa870+eKKenufhO_Hgw@mail.gmail.com/raw [2]: https://kolegite.com/EE_library/datasheets_and_manuals/FPGA/AMBA/IHI0050E_a_amba_5_chi_architecture_spec.pdf > > Linus >
On Wed, Nov 22, 2023 at 11:11:38AM -0800, Linus Torvalds wrote: > On Wed, 22 Nov 2023 at 09:52, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Still not actually tested, but the code generation on x86 looks > > reasonable, so it migth be worth looking at whether it helps the > > RISC-V case. > > Doing some more munging, and actually looking at RISC-V code > generation too (I obviously had to enable ARCH_USE_CMPXCHG_LOCKREF for > RISC-V). > > I get this: > > lockref_get: > addi sp,sp,-32 > sd s0,16(sp) > sd s1,8(sp) > sd ra,24(sp) > addi s0,sp,32 > li a1,65536 > ld a5,0(a0) > mv s1,a0 > addi a1,a1,-1 > li a0,100 > .L43: > sext.w a3,a5 > li a4,1 > srliw a2,a5,16 > and a3,a3,a1 > slli a4,a4,32 > bne a2,a3,.L49 > add a4,a5,a4 > 0: > lr.d a3, 0(s1) > bne a3, a5, 1f > sc.d.rl a2, a4, 0(s1) > bnez a2, 0b > fence rw, rw > 1: > bne a5,a3,.L52 > ld ra,24(sp) > ld s0,16(sp) > ld s1,8(sp) > addi sp,sp,32 > jr ra > ... > > so now that single update is indeed just one single instruction: > > add a4,a5,a4 > > is that "increment count in the high 32 bits". > > The ticket lock being unlocked checks are those > > li a1,65536 > sext.w a3,a5 > srliw a2,a5,16 > and a3,a3,a1 > bne a2,a3,.L49 > > instructions if I read it right. > > That actually looks fairly close to optimal, although the frame setup > is kind of sad. > > (The above does not include the "loop if the cmpxchg failed" part of > the code generation) > > Anyway, apart from enabling LOCKREF, the patch to get this for RISC-V > is attached. > > I'm not going to play with this any more, but you might want to check > whether this actually does work on RISC-V. > > Becaue I only looked at the code generation, I didn't actually look at > whether it *worked*. > > Linus > From 168f35850c15468941e597907e33daacd179d54a Mon Sep 17 00:00:00 2001 > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Wed, 22 Nov 2023 09:33:29 -0800 > Subject: [PATCH] lockref: improve code generation for ref updates > > Our lockref data structure is two 32-bit words laid out next to each > other, combining the spinlock and the count into one entity that can be > accessed atomically together. > > In particular, the structure is laid out so that the count is the upper > 32 bit word (on little-endian), so that you can do basic arithmetic on > the count in 64 bits: instead of adding one to the 32-bit word, you can > just add a value shifted by 32 to the full 64-bit word. > > Sadly, neither gcc nor clang are quite clever enough to work that out on > their own, so this does that "manually". > > Also, try to do any compares against zero values, which generally > improves the code generation. So rather than check that the value was > at least 1 before a decrement, check that it's positive or zero after > the decrement. We don't worry about the overflow point in lockrefs. > > Cc: Guo Ren <guoren@kernel.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > --- > lib/lockref.c | 29 ++++++++++++++++++++--------- > 1 file changed, 20 insertions(+), 9 deletions(-) > > diff --git a/lib/lockref.c b/lib/lockref.c > index 2afe4c5d8919..f3c30c538af1 100644 > --- a/lib/lockref.c > +++ b/lib/lockref.c > @@ -26,6 +26,17 @@ > } \ > } while (0) > > +/* > + * The compiler isn't smart enough to the the count > + * increment in the high 32 bits of the 64-bit value, > + * so do this optimization by hand. > + */ > +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64 > + #define LOCKREF_ADD(n,x) ((n).lock_count += (unsigned long)(x)<<32) > +#else > + #define LOCKREF_ADD(n,x) ((n).count += (unsigned long)(x)<<32) #define LOCKREF_ADD(n,x) ((n).count += (unsigned long)(x)) ? > +#endif > + > #else > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0) > @@ -42,7 +53,7 @@ > void lockref_get(struct lockref *lockref) > { > CMPXCHG_LOOP( > - new.count++; > + LOCKREF_ADD(new,1); > , > return; > ); > @@ -63,9 +74,9 @@ int lockref_get_not_zero(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count++; > if (old.count <= 0) > return 0; > + LOCKREF_ADD(new,1); > , > return 1; > ); > @@ -91,8 +102,8 @@ int lockref_put_not_zero(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count--; > - if (old.count <= 1) > + LOCKREF_ADD(new,-1); > + if (new.count <= 0) > return 0; > , > return 1; > @@ -119,8 +130,8 @@ EXPORT_SYMBOL(lockref_put_not_zero); > int lockref_put_return(struct lockref *lockref) > { > CMPXCHG_LOOP( > - new.count--; > - if (old.count <= 0) > + LOCKREF_ADD(new,-1); > + if (new.count < 0) > return -1; > , > return new.count; > @@ -137,8 +148,8 @@ EXPORT_SYMBOL(lockref_put_return); > int lockref_put_or_lock(struct lockref *lockref) > { > CMPXCHG_LOOP( > - new.count--; > - if (old.count <= 1) > + LOCKREF_ADD(new,-1); > + if (new.count <= 0) > break; > , > return 1; > @@ -174,9 +185,9 @@ int lockref_get_not_dead(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count++; > if (old.count < 0) > return 0; > + LOCKREF_ADD(new,1); > , > return 1; > ); > -- > 2.43.0.5.g38fb137bdb >
On Wed, 29 Nov 2023 at 04:25, Guo Ren <guoren@kernel.org> wrote: > > > +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64 > > + #define LOCKREF_ADD(n,x) ((n).lock_count += (unsigned long)(x)<<32) > > +#else > > + #define LOCKREF_ADD(n,x) ((n).count += (unsigned long)(x)<<32) > #define LOCKREF_ADD(n,x) ((n).count += (unsigned long)(x)) > ? Yes. I obviously only tested the little-endian case, and the BE case was a bit too much cut-and-paste.. Linus
On Sun, Nov 26, 2023 at 08:51:35AM -0800, Linus Torvalds wrote: > On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@kernel.org> wrote: > > > > Here is my optimization advice: > > > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > int retry = 100; \ > > struct lockref old; \ > > BUILD_BUG_ON(sizeof(old) != 8); \ > > + prefetchw(lockref); \\ > > No. > > We're not adding software prefetches to generic code. Been there, done > that. They *never* improve performance on good hardware. They end up > helping on some random (usually particularly bad) microarchitecture, > and then they hurt everybody else. > > And the real optimization advice is: "don't run on crap hardware". > > It really is that simple. Good hardware does OoO and sees the future write. That needs the expensive mechanism DynAMO [1], but some power-efficient core lacks the capability. Yes, powerful OoO hardware could virtually satisfy you by a minimum number of retries, but why couldn't we explicitly tell hardware for "prefetchw"? Advanced hardware would treat cmpxchg as interconnect transactions when cache miss(far atomic), which means L3 cache wouldn't return a unique cacheline even when cmpxchg fails. The cmpxchg loop would continue to read data bypassing the L1/L2 cache, which means every failure cmpxchg is a cache-miss read. Because of the "new.count++"/CODE data dependency, the continuous cmpxchg requests must wait first finish. This will cause a gap between cmpxchg requests, which will cause most CPU's cmpxchgs continue failling during serious contention. cas: Compare-And-Swap L1&L2 L3 cache +------+ +----------- | CPU1 | wait | | cas2 |------>| CPU1_cas1 --+ +------+ | | +------+ | | | CPU2 | wait | | | cas2 |------>| CPU2_cas1 --+--> If queued with CPU1_cas1 CPU2_cas1 +------+ | | CPU3_cas1, and most of CPUs would +------+ | | fail and retry. | CPU3 | wait | | | cas2 |------>| CPU3_cas1---+ +------+ +---------- The entire system moves forward with inefficiency: - A large number of invalid read requests CPU->L3 - High power consumption - Poor performance But, the “far atomic” is suitable for scenarios where contention is not particularly serious. So it is reasonable to let the software give prompts. That is "prefetchw": - The prefetchw is the preparation of "load + cmpxchg loop." - The prefetchw is not for single AMO or CAS or Store. [1] https://dl.acm.org/doi/10.1145/3579371.3589065 > > > Micro-arch could give prefetchw more guarantee: > > Well, in practice, they never do, and in fact they are often buggy and > cause problems because they weren't actually tested very much. > > Linus >
On Thu, 30 Nov 2023 at 19:01, Guo Ren <guoren@kernel.org> wrote: > > That needs the expensive mechanism DynAMO [1], but some power-efficient > core lacks the capability. Yes, powerful OoO hardware could virtually > satisfy you by a minimum number of retries, but why couldn't we > explicitly tell hardware for "prefetchw"? Because every single time we've had a prefetch in the kernel, it has caused problems. A bit like cpu_relax() - these things get added for random hardware where it helps, and then a few years later it turns out that it hurts almost everywhere else. We've had particular problems with 'prefetch' because it turns out that (a) nobody sane uses them so (b) hardware is often buggy. And here "buggy" may be just performance (ie "prefetch actually stalls on TLB lookup" etc broken behavior that means that prefetch is not even remotely like a no-op that just hints to the cache subsystem), but sometimes even in actual semantics (ie "prefetch causes spurious faulting behavior") > Advanced hardware would treat cmpxchg as interconnect transactions when > cache miss(far atomic), which means L3 cache wouldn't return a unique > cacheline even when cmpxchg fails. The cmpxchg loop would continue to > read data bypassing the L1/L2 cache, which means every failure cmpxchg > is a cache-miss read. Honestly, I wouldn't call that "advanced hardware". I would call that ridiculous. If the cmpxchg isn't guaranteed to make progress, then the cmpxchg is broken. It's really that simple. It does sound like on your hardware, maybe you just want to make the RISC-V cmpxchg function always do a "prefetchw" if the 'sc.d' fails, something like "0: lr.w %0, %2\n" \ " bne %0, %z3, 1f\n" \ " sc.w %1, %z4, %2\n" \ - " bnez %1, 0b\n" \ + " beqz %1, 1f\n" \ + " prefetchw %2\n" \ + " j 0b\n" \ "1:\n" \ (quick entirely untested hack, you get the idea). A better implementation might use "asm goto" and expose the different error cases to the compiler so that it can move things around, but I'm not convinced it's worth the effort. But no, we're *not* adding a prefetchw to generic code just because apparently some RISC-V code is doing bad things. You need to keep workarounds for RISC-V behavior to RISC-V. And yes, the current "retry count" in our lockref implementation comes from another "some hardware does bad things for cmpxchg". But that workaround at most causes a few extra (regular) ALU instructions, and while not optimal, it's at least not going to cause any bigger problems. Linus
On Fri, Dec 01, 2023 at 10:09:01AM +0900, Linus Torvalds wrote: > On Thu, 30 Nov 2023 at 19:01, Guo Ren <guoren@kernel.org> wrote: > > > > That needs the expensive mechanism DynAMO [1], but some power-efficient > > core lacks the capability. Yes, powerful OoO hardware could virtually > > satisfy you by a minimum number of retries, but why couldn't we > > explicitly tell hardware for "prefetchw"? > > Because every single time we've had a prefetch in the kernel, it has > caused problems. A bit like cpu_relax() - these things get added for > random hardware where it helps, and then a few years later it turns > out that it hurts almost everywhere else. > > We've had particular problems with 'prefetch' because it turns out > that (a) nobody sane uses them so (b) hardware is often buggy. And > here "buggy" may be just performance (ie "prefetch actually stalls on > TLB lookup" etc broken behavior that means that prefetch is not even > remotely like a no-op that just hints to the cache subsystem), but > sometimes even in actual semantics (ie "prefetch causes spurious > faulting behavior") Thanks for sharing your experience, I now know the problem with generic prefetchw. But what to do with these codes? ➜ linux git:(master) ✗ grep prefetchw mm/ fs/ kernel/ -r mm/slub.c: prefetchw(object + s->offset); mm/slab.c: prefetchw(objp); mm/page_alloc.c: prefetchw(p); mm/page_alloc.c: prefetchw(p + 1); mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field) mm/vmscan.c: prefetchw(&prev->_field); mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field) do mm/vmscan.c: prefetchw_prev_lru_folio(folio, src, flags); fs/mpage.c: prefetchw(&folio->flags); fs/f2fs/data.c: prefetchw(&page->flags); fs/ext4/readpage.c: prefetchw(&folio->flags); kernel/bpf/cpumap.c: prefetchw(page); kernel/locking/qspinlock.c: prefetchw(next); ➜ linux git:(master) ✗ grep prefetchw drivers/ -r | wc -l 80 ... > > > Advanced hardware would treat cmpxchg as interconnect transactions when > > cache miss(far atomic), which means L3 cache wouldn't return a unique > > cacheline even when cmpxchg fails. The cmpxchg loop would continue to > > read data bypassing the L1/L2 cache, which means every failure cmpxchg > > is a cache-miss read. > > Honestly, I wouldn't call that "advanced hardware". I would call that > ridiculous. Ridiculous Hardware: When CAS fails, the hardware still keeps "far atomic" mode. Correct Hardware: When CAS fails, the hardware should change to "near-atomic," which means acquiring an exclusive cache line and making progress. > > If the cmpxchg isn't guaranteed to make progress, then the cmpxchg is > broken. It's really that simple. I totally agree, and it's a correct guide, Thx. > > It does sound like on your hardware, maybe you just want to make the > RISC-V cmpxchg function always do a "prefetchw" if the 'sc.d' fails, > something like > > "0: lr.w %0, %2\n" \ > " bne %0, %z3, 1f\n" \ > " sc.w %1, %z4, %2\n" \ > - " bnez %1, 0b\n" \ > + " beqz %1, 1f\n" \ > + " prefetchw %2\n" \ > + " j 0b\n" \ > "1:\n" \ I modify your code to guarantee the progress of the comparison failure situation: Final version (for easy read): "0: lr.w %0, %2\n" \ " bne %0, %z3, 2f\n" \ " sc.w %1, %z4, %2\n" \ " beqz %1, 1f\n" \ " prefetchw %2\n" \ " j 0b\n" \ "2:\n" \ " prefetchw %2\n" \ "1:\n" \ Diff version: "0: lr.w %0, %2\n" \ - " bne %0, %z3, 1f\n" \ + " bne %0, %z3, 2f\n" \ " sc.w %1, %z4, %2\n" \ - " bnez %1, 0b\n" \ + " beqz %1, 1f\n" \ + " prefetchw %2\n" \ + " j 0b\n" \ + "2:\n" \ + " prefetchw %2\n" \ "1:\n" \ > > (quick entirely untested hack, you get the idea). A better > implementation might use "asm goto" and expose the different error > cases to the compiler so that it can move things around, but I'm not > convinced it's worth the effort. > > But no, we're *not* adding a prefetchw to generic code just because > apparently some RISC-V code is doing bad things. You need to keep > workarounds for RISC-V behavior to RISC-V. > > And yes, the current "retry count" in our lockref implementation comes > from another "some hardware does bad things for cmpxchg". But that > workaround at most causes a few extra (regular) ALU instructions, and > while not optimal, it's at least not going to cause any bigger > problems. > > Linus >
On Fri, 1 Dec 2023 at 12:36, Guo Ren <guoren@kernel.org> wrote: > > I modify your code to guarantee the progress of the comparison failure > situation: Are you sure you want to prefetch when the value doesn't even match the existing value? Aren't you better off just looping doing just reads until you at least have a valid value to exchange? Otherwise you might easily find that your cmpxchg loops cause horrendous cacheline ping-pong patterns. Of course, if your hardware is bad at releasing the written state, that may actually be what you want, to see changes in a timely manner. At least some of our cmpxchg uses are the "try_cmpxchg()" pattern, which wouldn't even loop - and won't write at all - on a value mismatch. And some of those try_cmpxchg cases are a lot more important than the lockref code. Things like spin_trylock() etc. Of course, for best results you might want to have an actual architecture-specific helper for the try_cmpxchg case, and use the compiler for "outputs in condition codes" (but then you need to have fallback cases for older compilers that don't support it). See the code code for example of the kinds of nasty support code you need with /* * Macros to generate condition code outputs from inline assembly, * The output operand must be type "bool". */ #ifdef __GCC_ASM_FLAG_OUTPUTS__ # define CC_SET(c) "\n\t/* output condition code " #c "*/\n" # define CC_OUT(c) "=@cc" #c #else # define CC_SET(c) "\n\tset" #c " %[_cc_" #c "]\n" # define CC_OUT(c) [_cc_ ## c] "=qm" #endif and then a lot of "CC_SET()/CC_OUT()" use in the inline asms in <asm/cmpxchg.h>... IOW, you really should time this and then add the timing information to whatever commit message. Linus
On Fri, Dec 01, 2023 at 02:15:15PM +0900, Linus Torvalds wrote: > On Fri, 1 Dec 2023 at 12:36, Guo Ren <guoren@kernel.org> wrote: > > > > I modify your code to guarantee the progress of the comparison failure > > situation: > > Are you sure you want to prefetch when the value doesn't even match > the existing value? Aren't you better off just looping doing just > reads until you at least have a valid value to exchange? Oops, you are right, I'm wrong here. Here is what I want to say: + " prefetch %2\n" \ "0: lr.w %0, %2\n" \ " bne %0, %z3, 1f\n" \ " sc.w %1, %z4, %2\n" \ " beqz %1, 1f\n" \ " prefetchw %2\n" \ " j 0b\n" \ "1:\n" \ Just add a prefetch shared cache line before, which could stick the shared cache line several cycles to ensure the outer cmpxchg loop could make progress. All we wrote here is not for actual code, and it's just a reference for hardware guys. - lr could imply a sticky shared cache line. - sc could imply a sticky unique cache line when sc fails. > > Otherwise you might easily find that your cmpxchg loops cause > horrendous cacheline ping-pong patterns. > > Of course, if your hardware is bad at releasing the written state, > that may actually be what you want, to see changes in a timely manner. > > At least some of our cmpxchg uses are the "try_cmpxchg()" pattern, > which wouldn't even loop - and won't write at all - on a value > mismatch. > > And some of those try_cmpxchg cases are a lot more important than the > lockref code. Things like spin_trylock() etc. Of course, for best > results you might want to have an actual architecture-specific helper > for the try_cmpxchg case, and use the compiler for "outputs in > condition codes" (but then you need to have fallback cases for older > compilers that don't support it). > > See the code code for example of the kinds of nasty support code you need with > > /* > * Macros to generate condition code outputs from inline assembly, > * The output operand must be type "bool". > */ > #ifdef __GCC_ASM_FLAG_OUTPUTS__ > # define CC_SET(c) "\n\t/* output condition code " #c "*/\n" > # define CC_OUT(c) "=@cc" #c > #else > # define CC_SET(c) "\n\tset" #c " %[_cc_" #c "]\n" > # define CC_OUT(c) [_cc_ ## c] "=qm" > #endif > > and then a lot of "CC_SET()/CC_OUT()" use in the inline asms in > <asm/cmpxchg.h>... Thanks for the tip. It's helpful to try_cmpxchg optimization. > > IOW, you really should time this and then add the timing information > to whatever commit message. > > Linus >
diff --git a/fs/dcache.c b/fs/dcache.c index bd57b9a08894..9e1486db64a7 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -665,30 +665,57 @@ static bool lock_for_kill(struct dentry *dentry) return false; } -static inline bool retain_dentry(struct dentry *dentry) +/* + * Decide if dentry is worth retaining. Usually this is called with dentry + * locked; if not locked, we are more limited and might not be able to tell + * without a lock. False in this case means "punt to locked path and recheck". + * + * In case we aren't locked, these predicates are not "stable". However, it is + * sufficient that at some point after we dropped the reference the dentry was + * hashed and the flags had the proper value. Other dentry users may have + * re-gotten a reference to the dentry and change that, but our work is done - + * we can leave the dentry around with a zero refcount. + */ +static inline bool retain_dentry(struct dentry *dentry, bool locked) { - WARN_ON(d_in_lookup(dentry)); + unsigned int d_flags; - /* Unreachable? Get rid of it */ + smp_rmb(); + d_flags = READ_ONCE(dentry->d_flags); + + // Unreachable? Nobody would be able to look it up, no point retaining if (unlikely(d_unhashed(dentry))) return false; - if (unlikely(dentry->d_flags & DCACHE_DISCONNECTED)) + // Same if it's disconnected + if (unlikely(d_flags & DCACHE_DISCONNECTED)) return false; - if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) { - if (dentry->d_op->d_delete(dentry)) + // ->d_delete() might tell us not to bother, but that requires + // ->d_lock; can't decide without it + if (unlikely(d_flags & DCACHE_OP_DELETE)) { + if (!locked || dentry->d_op->d_delete(dentry)) return false; } - if (unlikely(dentry->d_flags & DCACHE_DONTCACHE)) + // Explicitly told not to bother + if (unlikely(d_flags & DCACHE_DONTCACHE)) return false; - /* retain; LRU fodder */ - if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST))) + // At this point it looks like we ought to keep it. We also might + // need to do something - put it on LRU if it wasn't there already + // and mark it referenced if it was on LRU, but not marked yet. + // Unfortunately, both actions require ->d_lock, so in lockless + // case we'd have to punt rather than doing those. + if (unlikely(!(d_flags & DCACHE_LRU_LIST))) { + if (!locked) + return false; d_lru_add(dentry); - else if (unlikely(!(dentry->d_flags & DCACHE_REFERENCED))) + } else if (unlikely(!(d_flags & DCACHE_REFERENCED))) { + if (!locked) + return false; dentry->d_flags |= DCACHE_REFERENCED; + } return true; } @@ -720,7 +747,6 @@ EXPORT_SYMBOL(d_mark_dontcache); static inline bool fast_dput(struct dentry *dentry) { int ret; - unsigned int d_flags; /* * try to decrement the lockref optimistically. @@ -749,45 +775,18 @@ static inline bool fast_dput(struct dentry *dentry) return true; /* - * Careful, careful. The reference count went down - * to zero, but we don't hold the dentry lock, so - * somebody else could get it again, and do another - * dput(), and we need to not race with that. - * - * However, there is a very special and common case - * where we don't care, because there is nothing to - * do: the dentry is still hashed, it does not have - * a 'delete' op, and it's referenced and already on - * the LRU list. - * - * NOTE! Since we aren't locked, these values are - * not "stable". However, it is sufficient that at - * some point after we dropped the reference the - * dentry was hashed and the flags had the proper - * value. Other dentry users may have re-gotten - * a reference to the dentry and change that, but - * our work is done - we can leave the dentry - * around with a zero refcount. - * - * Nevertheless, there are two cases that we should kill - * the dentry anyway. - * 1. free disconnected dentries as soon as their refcount - * reached zero. - * 2. free dentries if they should not be cached. + * Can we decide that decrement of refcount is all we needed without + * taking the lock? There's a very common case when it's all we need - + * dentry looks like it ought to be retained and there's nothing else + * to do. */ - smp_rmb(); - d_flags = READ_ONCE(dentry->d_flags); - d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_OP_DELETE | - DCACHE_DISCONNECTED | DCACHE_DONTCACHE; - - /* Nothing to do? Dropping the reference was all we needed? */ - if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry)) + if (retain_dentry(dentry, false)) return true; /* - * Not the fast normal case? Get the lock. We've already decremented - * the refcount, but we'll need to re-check the situation after - * getting the lock. + * Either not worth retaining or we can't tell without the lock. + * Get the lock, then. We've already decremented the refcount to 0, + * but we'll need to re-check the situation after getting the lock. */ spin_lock(&dentry->d_lock); @@ -798,7 +797,7 @@ static inline bool fast_dput(struct dentry *dentry) * don't need to do anything else. */ locked: - if (dentry->d_lockref.count || retain_dentry(dentry)) { + if (dentry->d_lockref.count || retain_dentry(dentry, true)) { spin_unlock(&dentry->d_lock); return true; } @@ -847,7 +846,7 @@ void dput(struct dentry *dentry) dentry = __dentry_kill(dentry); if (!dentry) return; - if (retain_dentry(dentry)) { + if (retain_dentry(dentry, true)) { spin_unlock(&dentry->d_lock); return; }