diff mbox

[02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath

Message ID 20180409145409.GA9661@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Will Deacon April 9, 2018, 2:54 p.m. UTC
On Mon, Apr 09, 2018 at 11:58:35AM +0100, Will Deacon wrote:
> On Fri, Apr 06, 2018 at 04:50:19PM -0400, Waiman Long wrote:
> > The pending bit was added to the qspinlock design to counter performance
> > degradation compared with ticket lock for workloads with light
> > spinlock contention. I run my spinlock stress test on a Intel Skylake
> > server running the vanilla 4.16 kernel vs a patched kernel with this
> > patchset. The locking rates with different number of locking threads
> > were as follows:
> > 
> >   # of threads  4.16 kernel     patched 4.16 kernel
> >   ------------  -----------     -------------------
> >         1       7,417 kop/s         7,408 kop/s
> >         2       5,755 kop/s         4,486 kop/s
> >         3       4,214 kop/s         4,169 kop/s
> >         4       4,396 kop/s         4,383 kop/s
> >        
> > The 2 contending threads case is the one that exercise the pending bit
> > code path the most. So it is obvious that this is the one that is most
> > impacted by this patchset. The differences in the other cases are mostly
> > noise or maybe just a little bit on the 3 contending threads case.
> 
> That is bizarre. A few questions:
> 
>   1. Is this with my patches as posted, or also with your WRITE_ONCE change?
>   2. Could you try to bisect my series to see which patch is responsible
>      for this degradation, please?
>   3. Could you point me at your stress test, so I can try to reproduce these
>      numbers on arm64 systems, please?
> 
> > I am not against this patch, but we certainly need to find out a way to
> > bring the performance number up closer to what it is before applying
> > the patch.
> 
> We certainly need to *understand* where the drop is coming from, because
> the two-threaded case is still just a CAS on x86 with and without this
> patch series. Generally, there's a throughput cost when ensuring fairness
> and forward-progress otherwise we'd all be using test-and-set.

Whilst I think we still need to address my questions above, I've had a
crack at the diff below. Please can you give it a spin? It sticks a trylock
on the slowpath before setting pending and replaces the CAS-based set
with an xchg (which I *think* is safe, but will need to ponder it some
more).

Thanks,

Will

--->8

Comments

Peter Zijlstra April 9, 2018, 3:54 p.m. UTC | #1
On Mon, Apr 09, 2018 at 03:54:09PM +0100, Will Deacon wrote:

> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 19261af9f61e..71eb5e3a3d91 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -139,6 +139,20 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
>  	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
>  }
>  
> +/**
> + * set_pending_fetch_acquire - set the pending bit and return the old lock
> + *                             value with acquire semantics.
> + * @lock: Pointer to queued spinlock structure
> + *
> + * *,*,* -> *,1,*
> + */
> +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> +{
> +	u32 val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;
> +	val |= (atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK);
> +	return val;
> +}

> @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  		return;
>  
>  	/*
> -	 * If we observe any contention; queue.
> +	 * If we observe queueing, then queue ourselves.
>  	 */
> -	if (val & ~_Q_LOCKED_MASK)
> +	if (val & _Q_TAIL_MASK)
>  		goto queue;
>  
>  	/*
> +	 * We didn't see any queueing, so have one more try at snatching
> +	 * the lock in case it became available whilst we were taking the
> +	 * slow path.
> +	 */
> +	if (queued_spin_trylock(lock))
> +		return;
> +
> +	/*
>  	 * trylock || pending
>  	 *
>  	 * 0,0,0 -> 0,0,1 ; trylock
>  	 * 0,0,1 -> 0,1,1 ; pending
>  	 */
> +	val = set_pending_fetch_acquire(lock);
>  	if (!(val & ~_Q_LOCKED_MASK)) {

So, if I remember that partial paper correctly, the atomc_read_acquire()
can see 'arbitrary' old values for everything except the pending byte,
which it just wrote and will fwd into our load, right?

But I think coherence requires the read to not be older than the one
observed by the trylock before (since it uses c-cas its acquire can be
elided).

I think this means we can miss a concurrent unlock vs the fetch_or. And
I think that's fine, if we still see the lock set we'll needlessly 'wait'
for it go become unlocked.
Will Deacon April 9, 2018, 5:19 p.m. UTC | #2
On Mon, Apr 09, 2018 at 05:54:20PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 09, 2018 at 03:54:09PM +0100, Will Deacon wrote:
> > @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >  		return;
> >  
> >  	/*
> > -	 * If we observe any contention; queue.
> > +	 * If we observe queueing, then queue ourselves.
> >  	 */
> > -	if (val & ~_Q_LOCKED_MASK)
> > +	if (val & _Q_TAIL_MASK)
> >  		goto queue;
> >  
> >  	/*
> > +	 * We didn't see any queueing, so have one more try at snatching
> > +	 * the lock in case it became available whilst we were taking the
> > +	 * slow path.
> > +	 */
> > +	if (queued_spin_trylock(lock))
> > +		return;
> > +
> > +	/*
> >  	 * trylock || pending
> >  	 *
> >  	 * 0,0,0 -> 0,0,1 ; trylock
> >  	 * 0,0,1 -> 0,1,1 ; pending
> >  	 */
> > +	val = set_pending_fetch_acquire(lock);
> >  	if (!(val & ~_Q_LOCKED_MASK)) {
> 
> So, if I remember that partial paper correctly, the atomc_read_acquire()
> can see 'arbitrary' old values for everything except the pending byte,
> which it just wrote and will fwd into our load, right?
> 
> But I think coherence requires the read to not be older than the one
> observed by the trylock before (since it uses c-cas its acquire can be
> elided).
> 
> I think this means we can miss a concurrent unlock vs the fetch_or. And
> I think that's fine, if we still see the lock set we'll needlessly 'wait'
> for it go become unlocked.

Ah, but there is a related case that doesn't work. If the lock becomes
free just before we set pending, then another CPU can succeed on the
fastpath. We'll then set pending, but the lockword we get back may still
have the locked byte of 0, so two people end up holding the lock.

I think it's worth giving this a go with the added trylock, but I can't
see a way to avoid the atomic_fetch_or at the moment.

Will
Waiman Long April 9, 2018, 7:33 p.m. UTC | #3
On 04/09/2018 10:54 AM, Will Deacon wrote:
>
>>> I am not against this patch, but we certainly need to find out a way to
>>> bring the performance number up closer to what it is before applying
>>> the patch.
>> We certainly need to *understand* where the drop is coming from, because
>> the two-threaded case is still just a CAS on x86 with and without this
>> patch series. Generally, there's a throughput cost when ensuring fairness
>> and forward-progress otherwise we'd all be using test-and-set.
> Whilst I think we still need to address my questions above, I've had a
> crack at the diff below. Please can you give it a spin? It sticks a trylock
> on the slowpath before setting pending and replaces the CAS-based set
> with an xchg (which I *think* is safe, but will need to ponder it some
> more).
>
> Thanks,
>
> Will
>

Unfortunately, this patch didn't help.

pending count = 777
queuing count = 9,991,272
locking rate = 4,087 kop/s

-Longman
Peter Zijlstra April 10, 2018, 9:35 a.m. UTC | #4
On Mon, Apr 09, 2018 at 06:19:59PM +0100, Will Deacon wrote:
> On Mon, Apr 09, 2018 at 05:54:20PM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 09, 2018 at 03:54:09PM +0100, Will Deacon wrote:
> > > @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > >  		return;
> > >  
> > >  	/*
> > > -	 * If we observe any contention; queue.
> > > +	 * If we observe queueing, then queue ourselves.
> > >  	 */
> > > -	if (val & ~_Q_LOCKED_MASK)
> > > +	if (val & _Q_TAIL_MASK)
> > >  		goto queue;
> > >  
> > >  	/*
> > > +	 * We didn't see any queueing, so have one more try at snatching
> > > +	 * the lock in case it became available whilst we were taking the
> > > +	 * slow path.
> > > +	 */
> > > +	if (queued_spin_trylock(lock))
> > > +		return;
> > > +
> > > +	/*
> > >  	 * trylock || pending
> > >  	 *
> > >  	 * 0,0,0 -> 0,0,1 ; trylock
> > >  	 * 0,0,1 -> 0,1,1 ; pending
> > >  	 */
> > > +	val = set_pending_fetch_acquire(lock);
> > >  	if (!(val & ~_Q_LOCKED_MASK)) {
> > 
> > So, if I remember that partial paper correctly, the atomc_read_acquire()
> > can see 'arbitrary' old values for everything except the pending byte,
> > which it just wrote and will fwd into our load, right?
> > 
> > But I think coherence requires the read to not be older than the one
> > observed by the trylock before (since it uses c-cas its acquire can be
> > elided).
> > 
> > I think this means we can miss a concurrent unlock vs the fetch_or. And
> > I think that's fine, if we still see the lock set we'll needlessly 'wait'
> > for it go become unlocked.
> 
> Ah, but there is a related case that doesn't work. If the lock becomes
> free just before we set pending, then another CPU can succeed on the
> fastpath. We'll then set pending, but the lockword we get back may still
> have the locked byte of 0, so two people end up holding the lock.
> 
> I think it's worth giving this a go with the added trylock, but I can't
> see a way to avoid the atomic_fetch_or at the moment.

Oh yikes, indeed. Yeah, I don't see how we'd be able to fix that one.
Peter Zijlstra Sept. 20, 2018, 4:08 p.m. UTC | #5
On Mon, Apr 09, 2018 at 06:19:59PM +0100, Will Deacon wrote:
> On Mon, Apr 09, 2018 at 05:54:20PM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 09, 2018 at 03:54:09PM +0100, Will Deacon wrote:

> > > +/**
> > > + * set_pending_fetch_acquire - set the pending bit and return the old lock
> > > + *                             value with acquire semantics.
> > > + * @lock: Pointer to queued spinlock structure
> > > + *
> > > + * *,*,* -> *,1,*
> > > + */
> > > +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> > > +{
> > > +       u32 val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;

	smp_mb();

> > > +       val |= (atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK);
> > > +       return val;
> > > +}

> > > @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > >  		return;
> > >  
> > >  	/*
> > > -	 * If we observe any contention; queue.
> > > +	 * If we observe queueing, then queue ourselves.
> > >  	 */
> > > -	if (val & ~_Q_LOCKED_MASK)
> > > +	if (val & _Q_TAIL_MASK)
> > >  		goto queue;
> > >  
> > >  	/*
> > > +	 * We didn't see any queueing, so have one more try at snatching
> > > +	 * the lock in case it became available whilst we were taking the
> > > +	 * slow path.
> > > +	 */
> > > +	if (queued_spin_trylock(lock))
> > > +		return;
> > > +
> > > +	/*
> > >  	 * trylock || pending
> > >  	 *
> > >  	 * 0,0,0 -> 0,0,1 ; trylock
> > >  	 * 0,0,1 -> 0,1,1 ; pending
> > >  	 */
> > > +	val = set_pending_fetch_acquire(lock);
> > >  	if (!(val & ~_Q_LOCKED_MASK)) {
> > 
> > So, if I remember that partial paper correctly, the atomc_read_acquire()
> > can see 'arbitrary' old values for everything except the pending byte,
> > which it just wrote and will fwd into our load, right?
> > 
> > But I think coherence requires the read to not be older than the one
> > observed by the trylock before (since it uses c-cas its acquire can be
> > elided).
> > 
> > I think this means we can miss a concurrent unlock vs the fetch_or. And
> > I think that's fine, if we still see the lock set we'll needlessly 'wait'
> > for it go become unlocked.
> 
> Ah, but there is a related case that doesn't work. If the lock becomes
> free just before we set pending, then another CPU can succeed on the
> fastpath. We'll then set pending, but the lockword we get back may still
> have the locked byte of 0, so two people end up holding the lock.
> 
> I think it's worth giving this a go with the added trylock, but I can't
> see a way to avoid the atomic_fetch_or at the moment.

So IIRC the addition of the smp_mb() above should ensure the @val load
is later than the @pending store.

Which makes the thing work again, right?

Now, obviously you don't actually want that on ARM64, but I can do that
on x86 just fine (our xchg() implies smp_mb() after all).


Another approach might be to use something like:

	val = xchg_relaxed(&lock->locked_pending, _Q_PENDING_VAL | _Q_LOCKED_VAL);
	val |= atomic_read_acquire(&lock->val) & _Q_TAIL_MASK;

combined with something like:

	/* 0,0,0 -> 0,1,1 - we won trylock */
	if (!(val & _Q_LOCKED_MASK)) {
		clear_pending(lock);
		return;
	}

	/* 0,0,1 -> 0,1,1 - we won pending */
	if (!(val & ~_Q_LOCKED_MASK)) {
		...
	}

	/* *,0,1 -> *,1,1 - we won pending, but there's queueing */
	if (!(val & _Q_PENDING_VAL))
		clear_pending(lock);

	...


Hmmm?
Peter Zijlstra Sept. 20, 2018, 4:22 p.m. UTC | #6
On Thu, Sep 20, 2018 at 06:08:32PM +0200, Peter Zijlstra wrote:
> Another approach might be to use something like:
> 
> 	val = xchg_relaxed(&lock->locked_pending, _Q_PENDING_VAL | _Q_LOCKED_VAL);
> 	val |= atomic_read_acquire(&lock->val) & _Q_TAIL_MASK;
> 
> combined with something like:
> 
> 	/* 0,0,0 -> 0,1,1 - we won trylock */
> 	if (!(val & _Q_LOCKED_MASK)) {

That one doesn't actually work... let me think about this more.

> 		clear_pending(lock);
> 		return;
> 	}
> 
> 	/* 0,0,1 -> 0,1,1 - we won pending */
> 	if (!(val & ~_Q_LOCKED_MASK)) {
> 		...
> 	}
> 
> 	/* *,0,1 -> *,1,1 - we won pending, but there's queueing */
> 	if (!(val & _Q_PENDING_VAL))
> 		clear_pending(lock);
> 
> 	...
> 
> 
> Hmmm?
diff mbox

Patch

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 19261af9f61e..71eb5e3a3d91 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -139,6 +139,20 @@  static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
 }
 
+/**
+ * set_pending_fetch_acquire - set the pending bit and return the old lock
+ *                             value with acquire semantics.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,*,* -> *,1,*
+ */
+static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
+{
+	u32 val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;
+	val |= (atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK);
+	return val;
+}
+
 /*
  * xchg_tail - Put in the new queue tail code word & retrieve previous one
  * @lock : Pointer to queued spinlock structure
@@ -184,6 +198,18 @@  static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 }
 
 /**
+ * set_pending_fetch_acquire - set the pending bit and return the old lock
+ *                             value with acquire semantics.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,*,* -> *,1,*
+ */
+static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
+{
+	return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+}
+
+/**
  * xchg_tail - Put in the new queue tail code word & retrieve previous one
  * @lock : Pointer to queued spinlock structure
  * @tail : The new queue tail code word
@@ -289,18 +315,26 @@  void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		return;
 
 	/*
-	 * If we observe any contention; queue.
+	 * If we observe queueing, then queue ourselves.
 	 */
-	if (val & ~_Q_LOCKED_MASK)
+	if (val & _Q_TAIL_MASK)
 		goto queue;
 
 	/*
+	 * We didn't see any queueing, so have one more try at snatching
+	 * the lock in case it became available whilst we were taking the
+	 * slow path.
+	 */
+	if (queued_spin_trylock(lock))
+		return;
+
+	/*
 	 * trylock || pending
 	 *
 	 * 0,0,0 -> 0,0,1 ; trylock
 	 * 0,0,1 -> 0,1,1 ; pending
 	 */
-	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+	val = set_pending_fetch_acquire(lock);
 	if (!(val & ~_Q_LOCKED_MASK)) {
 		/*
 		 * we're pending, wait for the owner to go away.