Message ID | 20180409145409.GA9661@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Apr 09, 2018 at 03:54:09PM +0100, Will Deacon wrote: > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c > index 19261af9f61e..71eb5e3a3d91 100644 > --- a/kernel/locking/qspinlock.c > +++ b/kernel/locking/qspinlock.c > @@ -139,6 +139,20 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock) > WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL); > } > > +/** > + * set_pending_fetch_acquire - set the pending bit and return the old lock > + * value with acquire semantics. > + * @lock: Pointer to queued spinlock structure > + * > + * *,*,* -> *,1,* > + */ > +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock) > +{ > + u32 val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET; > + val |= (atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK); > + return val; > +} > @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > return; > > /* > - * If we observe any contention; queue. > + * If we observe queueing, then queue ourselves. > */ > - if (val & ~_Q_LOCKED_MASK) > + if (val & _Q_TAIL_MASK) > goto queue; > > /* > + * We didn't see any queueing, so have one more try at snatching > + * the lock in case it became available whilst we were taking the > + * slow path. > + */ > + if (queued_spin_trylock(lock)) > + return; > + > + /* > * trylock || pending > * > * 0,0,0 -> 0,0,1 ; trylock > * 0,0,1 -> 0,1,1 ; pending > */ > + val = set_pending_fetch_acquire(lock); > if (!(val & ~_Q_LOCKED_MASK)) { So, if I remember that partial paper correctly, the atomc_read_acquire() can see 'arbitrary' old values for everything except the pending byte, which it just wrote and will fwd into our load, right? But I think coherence requires the read to not be older than the one observed by the trylock before (since it uses c-cas its acquire can be elided). I think this means we can miss a concurrent unlock vs the fetch_or. And I think that's fine, if we still see the lock set we'll needlessly 'wait' for it go become unlocked.
On Mon, Apr 09, 2018 at 05:54:20PM +0200, Peter Zijlstra wrote: > On Mon, Apr 09, 2018 at 03:54:09PM +0100, Will Deacon wrote: > > @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > > return; > > > > /* > > - * If we observe any contention; queue. > > + * If we observe queueing, then queue ourselves. > > */ > > - if (val & ~_Q_LOCKED_MASK) > > + if (val & _Q_TAIL_MASK) > > goto queue; > > > > /* > > + * We didn't see any queueing, so have one more try at snatching > > + * the lock in case it became available whilst we were taking the > > + * slow path. > > + */ > > + if (queued_spin_trylock(lock)) > > + return; > > + > > + /* > > * trylock || pending > > * > > * 0,0,0 -> 0,0,1 ; trylock > > * 0,0,1 -> 0,1,1 ; pending > > */ > > + val = set_pending_fetch_acquire(lock); > > if (!(val & ~_Q_LOCKED_MASK)) { > > So, if I remember that partial paper correctly, the atomc_read_acquire() > can see 'arbitrary' old values for everything except the pending byte, > which it just wrote and will fwd into our load, right? > > But I think coherence requires the read to not be older than the one > observed by the trylock before (since it uses c-cas its acquire can be > elided). > > I think this means we can miss a concurrent unlock vs the fetch_or. And > I think that's fine, if we still see the lock set we'll needlessly 'wait' > for it go become unlocked. Ah, but there is a related case that doesn't work. If the lock becomes free just before we set pending, then another CPU can succeed on the fastpath. We'll then set pending, but the lockword we get back may still have the locked byte of 0, so two people end up holding the lock. I think it's worth giving this a go with the added trylock, but I can't see a way to avoid the atomic_fetch_or at the moment. Will
On 04/09/2018 10:54 AM, Will Deacon wrote: > >>> I am not against this patch, but we certainly need to find out a way to >>> bring the performance number up closer to what it is before applying >>> the patch. >> We certainly need to *understand* where the drop is coming from, because >> the two-threaded case is still just a CAS on x86 with and without this >> patch series. Generally, there's a throughput cost when ensuring fairness >> and forward-progress otherwise we'd all be using test-and-set. > Whilst I think we still need to address my questions above, I've had a > crack at the diff below. Please can you give it a spin? It sticks a trylock > on the slowpath before setting pending and replaces the CAS-based set > with an xchg (which I *think* is safe, but will need to ponder it some > more). > > Thanks, > > Will > Unfortunately, this patch didn't help. pending count = 777 queuing count = 9,991,272 locking rate = 4,087 kop/s -Longman
On Mon, Apr 09, 2018 at 06:19:59PM +0100, Will Deacon wrote: > On Mon, Apr 09, 2018 at 05:54:20PM +0200, Peter Zijlstra wrote: > > On Mon, Apr 09, 2018 at 03:54:09PM +0100, Will Deacon wrote: > > > @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > > > return; > > > > > > /* > > > - * If we observe any contention; queue. > > > + * If we observe queueing, then queue ourselves. > > > */ > > > - if (val & ~_Q_LOCKED_MASK) > > > + if (val & _Q_TAIL_MASK) > > > goto queue; > > > > > > /* > > > + * We didn't see any queueing, so have one more try at snatching > > > + * the lock in case it became available whilst we were taking the > > > + * slow path. > > > + */ > > > + if (queued_spin_trylock(lock)) > > > + return; > > > + > > > + /* > > > * trylock || pending > > > * > > > * 0,0,0 -> 0,0,1 ; trylock > > > * 0,0,1 -> 0,1,1 ; pending > > > */ > > > + val = set_pending_fetch_acquire(lock); > > > if (!(val & ~_Q_LOCKED_MASK)) { > > > > So, if I remember that partial paper correctly, the atomc_read_acquire() > > can see 'arbitrary' old values for everything except the pending byte, > > which it just wrote and will fwd into our load, right? > > > > But I think coherence requires the read to not be older than the one > > observed by the trylock before (since it uses c-cas its acquire can be > > elided). > > > > I think this means we can miss a concurrent unlock vs the fetch_or. And > > I think that's fine, if we still see the lock set we'll needlessly 'wait' > > for it go become unlocked. > > Ah, but there is a related case that doesn't work. If the lock becomes > free just before we set pending, then another CPU can succeed on the > fastpath. We'll then set pending, but the lockword we get back may still > have the locked byte of 0, so two people end up holding the lock. > > I think it's worth giving this a go with the added trylock, but I can't > see a way to avoid the atomic_fetch_or at the moment. Oh yikes, indeed. Yeah, I don't see how we'd be able to fix that one.
On Mon, Apr 09, 2018 at 06:19:59PM +0100, Will Deacon wrote: > On Mon, Apr 09, 2018 at 05:54:20PM +0200, Peter Zijlstra wrote: > > On Mon, Apr 09, 2018 at 03:54:09PM +0100, Will Deacon wrote: > > > +/** > > > + * set_pending_fetch_acquire - set the pending bit and return the old lock > > > + * value with acquire semantics. > > > + * @lock: Pointer to queued spinlock structure > > > + * > > > + * *,*,* -> *,1,* > > > + */ > > > +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock) > > > +{ > > > + u32 val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET; smp_mb(); > > > + val |= (atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK); > > > + return val; > > > +} > > > @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > > > return; > > > > > > /* > > > - * If we observe any contention; queue. > > > + * If we observe queueing, then queue ourselves. > > > */ > > > - if (val & ~_Q_LOCKED_MASK) > > > + if (val & _Q_TAIL_MASK) > > > goto queue; > > > > > > /* > > > + * We didn't see any queueing, so have one more try at snatching > > > + * the lock in case it became available whilst we were taking the > > > + * slow path. > > > + */ > > > + if (queued_spin_trylock(lock)) > > > + return; > > > + > > > + /* > > > * trylock || pending > > > * > > > * 0,0,0 -> 0,0,1 ; trylock > > > * 0,0,1 -> 0,1,1 ; pending > > > */ > > > + val = set_pending_fetch_acquire(lock); > > > if (!(val & ~_Q_LOCKED_MASK)) { > > > > So, if I remember that partial paper correctly, the atomc_read_acquire() > > can see 'arbitrary' old values for everything except the pending byte, > > which it just wrote and will fwd into our load, right? > > > > But I think coherence requires the read to not be older than the one > > observed by the trylock before (since it uses c-cas its acquire can be > > elided). > > > > I think this means we can miss a concurrent unlock vs the fetch_or. And > > I think that's fine, if we still see the lock set we'll needlessly 'wait' > > for it go become unlocked. > > Ah, but there is a related case that doesn't work. If the lock becomes > free just before we set pending, then another CPU can succeed on the > fastpath. We'll then set pending, but the lockword we get back may still > have the locked byte of 0, so two people end up holding the lock. > > I think it's worth giving this a go with the added trylock, but I can't > see a way to avoid the atomic_fetch_or at the moment. So IIRC the addition of the smp_mb() above should ensure the @val load is later than the @pending store. Which makes the thing work again, right? Now, obviously you don't actually want that on ARM64, but I can do that on x86 just fine (our xchg() implies smp_mb() after all). Another approach might be to use something like: val = xchg_relaxed(&lock->locked_pending, _Q_PENDING_VAL | _Q_LOCKED_VAL); val |= atomic_read_acquire(&lock->val) & _Q_TAIL_MASK; combined with something like: /* 0,0,0 -> 0,1,1 - we won trylock */ if (!(val & _Q_LOCKED_MASK)) { clear_pending(lock); return; } /* 0,0,1 -> 0,1,1 - we won pending */ if (!(val & ~_Q_LOCKED_MASK)) { ... } /* *,0,1 -> *,1,1 - we won pending, but there's queueing */ if (!(val & _Q_PENDING_VAL)) clear_pending(lock); ... Hmmm?
On Thu, Sep 20, 2018 at 06:08:32PM +0200, Peter Zijlstra wrote: > Another approach might be to use something like: > > val = xchg_relaxed(&lock->locked_pending, _Q_PENDING_VAL | _Q_LOCKED_VAL); > val |= atomic_read_acquire(&lock->val) & _Q_TAIL_MASK; > > combined with something like: > > /* 0,0,0 -> 0,1,1 - we won trylock */ > if (!(val & _Q_LOCKED_MASK)) { That one doesn't actually work... let me think about this more. > clear_pending(lock); > return; > } > > /* 0,0,1 -> 0,1,1 - we won pending */ > if (!(val & ~_Q_LOCKED_MASK)) { > ... > } > > /* *,0,1 -> *,1,1 - we won pending, but there's queueing */ > if (!(val & _Q_PENDING_VAL)) > clear_pending(lock); > > ... > > > Hmmm?
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 19261af9f61e..71eb5e3a3d91 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -139,6 +139,20 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock) WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL); } +/** + * set_pending_fetch_acquire - set the pending bit and return the old lock + * value with acquire semantics. + * @lock: Pointer to queued spinlock structure + * + * *,*,* -> *,1,* + */ +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock) +{ + u32 val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET; + val |= (atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK); + return val; +} + /* * xchg_tail - Put in the new queue tail code word & retrieve previous one * @lock : Pointer to queued spinlock structure @@ -184,6 +198,18 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock) } /** + * set_pending_fetch_acquire - set the pending bit and return the old lock + * value with acquire semantics. + * @lock: Pointer to queued spinlock structure + * + * *,*,* -> *,1,* + */ +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock) +{ + return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val); +} + +/** * xchg_tail - Put in the new queue tail code word & retrieve previous one * @lock : Pointer to queued spinlock structure * @tail : The new queue tail code word @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) return; /* - * If we observe any contention; queue. + * If we observe queueing, then queue ourselves. */ - if (val & ~_Q_LOCKED_MASK) + if (val & _Q_TAIL_MASK) goto queue; /* + * We didn't see any queueing, so have one more try at snatching + * the lock in case it became available whilst we were taking the + * slow path. + */ + if (queued_spin_trylock(lock)) + return; + + /* * trylock || pending * * 0,0,0 -> 0,0,1 ; trylock * 0,0,1 -> 0,1,1 ; pending */ - val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val); + val = set_pending_fetch_acquire(lock); if (!(val & ~_Q_LOCKED_MASK)) { /* * we're pending, wait for the owner to go away.