diff mbox series

[1/2] qspinlock: Ensure writes are pushed out of core write buffer

Message ID 20210127200109.16412-1-alexander.sverdlin@nokia.com (mailing list archive)
State New, archived
Headers show
Series [1/2] qspinlock: Ensure writes are pushed out of core write buffer | expand

Commit Message

Alexander Sverdlin Jan. 27, 2021, 8:01 p.m. UTC
From: Alexander Sverdlin <alexander.sverdlin@nokia.com>

Ensure writes are pushed out of core write buffer to prevent waiting code
on another cores from spinning longer than necessary.

6 threads running tight spinlock loop competing for the same lock
on 6 cores on MIPS/Octeon do 1000000 iterations...

before the patch in:	4.3 sec
after the patch in:	1.2 sec

Same 6-core Octeon machine:
sysbench --test=mutex --num-threads=64 --memory-scope=local run

w/o patch:	1.53s
with patch:	1.28s

This will also allow to remove the smp_wmb() in
arch/arm/include/asm/mcs_spinlock.h (was it actually addressing the same
issue?).

Finally our internal quite diverse test suite of different IPC/network
aspects didn't detect any regressions on ARM/ARM64/x86_64.

Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
---
 kernel/locking/mcs_spinlock.h | 5 +++++
 kernel/locking/qspinlock.c    | 6 ++++++
 2 files changed, 11 insertions(+)

Comments

Will Deacon Jan. 27, 2021, 10:21 p.m. UTC | #1
On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
> From: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> 
> Ensure writes are pushed out of core write buffer to prevent waiting code
> on another cores from spinning longer than necessary.
> 
> 6 threads running tight spinlock loop competing for the same lock
> on 6 cores on MIPS/Octeon do 1000000 iterations...
> 
> before the patch in:	4.3 sec
> after the patch in:	1.2 sec

If you only have 6 cores, I'm not sure qspinlock makes any sense...

> Same 6-core Octeon machine:
> sysbench --test=mutex --num-threads=64 --memory-scope=local run
> 
> w/o patch:	1.53s
> with patch:	1.28s
> 
> This will also allow to remove the smp_wmb() in
> arch/arm/include/asm/mcs_spinlock.h (was it actually addressing the same
> issue?).
> 
> Finally our internal quite diverse test suite of different IPC/network
> aspects didn't detect any regressions on ARM/ARM64/x86_64.
> 
> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> ---
>  kernel/locking/mcs_spinlock.h | 5 +++++
>  kernel/locking/qspinlock.c    | 6 ++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index 5e10153..10e497a 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>  		return;
>  	}
>  	WRITE_ONCE(prev->next, node);
> +	/*
> +	 * This is necessary to make sure that the corresponding "while" in the
> +	 * mcs_spin_unlock() doesn't loop forever
> +	 */
> +	smp_wmb();

If it loops forever, that's broken hardware design; store buffers need to
drain. I don't think we should add unconditional barriers to bodge this.

>  	/* Wait until the lock holder passes the lock down. */
>  	arch_mcs_spin_lock_contended(&node->locked);
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index cbff6ba..577fe01 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -469,6 +469,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  
>  		/* Link @node into the waitqueue. */
>  		WRITE_ONCE(prev->next, node);
> +		/*
> +		 * This is necessary to make sure that the corresponding
> +		 * smp_cond_load_relaxed() below (running on another core)
> +		 * doesn't spin forever.
> +		 */
> +		smp_wmb();

Likewise.

Will
Peter Zijlstra Jan. 27, 2021, 10:43 p.m. UTC | #2
On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
> From: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> 
> Ensure writes are pushed out of core write buffer to prevent waiting code
> on another cores from spinning longer than necessary.

Our smp_wmb() as defined does not have that property. You're relying on
some arch specific details which do not belong in common code.


> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index 5e10153..10e497a 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>  		return;
>  	}
>  	WRITE_ONCE(prev->next, node);
> +	/*
> +	 * This is necessary to make sure that the corresponding "while" in the
> +	 * mcs_spin_unlock() doesn't loop forever
> +	 */

This comment is utterly inadequate, since it does not describe an
explicit ordering between two (or more) stores.

> +	smp_wmb();
>  
>  	/* Wait until the lock holder passes the lock down. */
>  	arch_mcs_spin_lock_contended(&node->locked);
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index cbff6ba..577fe01 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -469,6 +469,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  
>  		/* Link @node into the waitqueue. */
>  		WRITE_ONCE(prev->next, node);
> +		/*
> +		 * This is necessary to make sure that the corresponding
> +		 * smp_cond_load_relaxed() below (running on another core)
> +		 * doesn't spin forever.
> +		 */
> +		smp_wmb();

That's insane, cache coherency should not allow that to happen in the
first place. Our smp_wmb() cannot help with that.

>  		pv_wait_node(node, prev);
>  		arch_mcs_spin_lock_contended(&node->locked);
Alexander Sverdlin Jan. 28, 2021, 7:36 a.m. UTC | #3
Hi!

On 27/01/2021 23:21, Will Deacon wrote:
> On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
>> From: Alexander Sverdlin <alexander.sverdlin@nokia.com>
>>
>> Ensure writes are pushed out of core write buffer to prevent waiting code
>> on another cores from spinning longer than necessary.
>>
>> 6 threads running tight spinlock loop competing for the same lock
>> on 6 cores on MIPS/Octeon do 1000000 iterations...
>>
>> before the patch in:	4.3 sec
>> after the patch in:	1.2 sec
> If you only have 6 cores, I'm not sure qspinlock makes any sense...

That's my impression as well and I even proposed to revert qspinlocks for MIPS:
https://lore.kernel.org/linux-mips/20170610002644.8434-1-paul.burton@imgtec.com/T/#ma1506c80472129b2ac41554cc2d863c9a24518c0

>> Same 6-core Octeon machine:
>> sysbench --test=mutex --num-threads=64 --memory-scope=local run
>>
>> w/o patch:	1.53s
>> with patch:	1.28s
>>
>> This will also allow to remove the smp_wmb() in
>> arch/arm/include/asm/mcs_spinlock.h (was it actually addressing the same
>> issue?).
>>
>> Finally our internal quite diverse test suite of different IPC/network
>> aspects didn't detect any regressions on ARM/ARM64/x86_64.
>>
>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
>> ---
>>  kernel/locking/mcs_spinlock.h | 5 +++++
>>  kernel/locking/qspinlock.c    | 6 ++++++
>>  2 files changed, 11 insertions(+)
>>
>> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
>> index 5e10153..10e497a 100644
>> --- a/kernel/locking/mcs_spinlock.h
>> +++ b/kernel/locking/mcs_spinlock.h
>> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>>  		return;
>>  	}
>>  	WRITE_ONCE(prev->next, node);
>> +	/*
>> +	 * This is necessary to make sure that the corresponding "while" in the
>> +	 * mcs_spin_unlock() doesn't loop forever
>> +	 */
>> +	smp_wmb();
> If it loops forever, that's broken hardware design; store buffers need to
> drain. I don't think we should add unconditional barriers to bodge this.

The comment is a bit exaggerating the situation, but it's undeterministic and you see the
measurements above. Something is wrong in the qspinlocks code, please consider this patch
"RFC", but something has to be done here.
Alexander Sverdlin Jan. 28, 2021, 7:42 a.m. UTC | #4
Hi!

On 27/01/2021 23:43, Peter Zijlstra wrote:
> On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
>> From: Alexander Sverdlin <alexander.sverdlin@nokia.com>
>>
>> Ensure writes are pushed out of core write buffer to prevent waiting code
>> on another cores from spinning longer than necessary.
> Our smp_wmb() as defined does not have that property. You're relying on
> some arch specific details which do not belong in common code.

Yes, my intention was SYNCW on Octeon, which by accident is smp_wmb().
Do you think that the core write buffer is only Octeon feature and there
will be no others?

Should I re-implement arch_mcs_spin_lock_contended() for Octeon only,
as it has been done for ARM?

>> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
>> index 5e10153..10e497a 100644
>> --- a/kernel/locking/mcs_spinlock.h
>> +++ b/kernel/locking/mcs_spinlock.h
>> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>>  		return;
>>  	}
>>  	WRITE_ONCE(prev->next, node);
>> +	/*
>> +	 * This is necessary to make sure that the corresponding "while" in the
>> +	 * mcs_spin_unlock() doesn't loop forever
>> +	 */
> This comment is utterly inadequate, since it does not describe an
> explicit ordering between two (or more) stores.
> 
>> +	smp_wmb();
>>  
>>  	/* Wait until the lock holder passes the lock down. */
>>  	arch_mcs_spin_lock_contended(&node->locked);
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index cbff6ba..577fe01 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -469,6 +469,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>  
>>  		/* Link @node into the waitqueue. */
>>  		WRITE_ONCE(prev->next, node);
>> +		/*
>> +		 * This is necessary to make sure that the corresponding
>> +		 * smp_cond_load_relaxed() below (running on another core)
>> +		 * doesn't spin forever.
>> +		 */
>> +		smp_wmb();
> That's insane, cache coherency should not allow that to happen in the
> first place. Our smp_wmb() cannot help with that.
>
Peter Zijlstra Jan. 28, 2021, 11:24 a.m. UTC | #5
On Thu, Jan 28, 2021 at 08:36:24AM +0100, Alexander Sverdlin wrote:

> >> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> >> index 5e10153..10e497a 100644
> >> --- a/kernel/locking/mcs_spinlock.h
> >> +++ b/kernel/locking/mcs_spinlock.h
> >> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> >>  		return;
> >>  	}
> >>  	WRITE_ONCE(prev->next, node);
> >> +	/*
> >> +	 * This is necessary to make sure that the corresponding "while" in the
> >> +	 * mcs_spin_unlock() doesn't loop forever
> >> +	 */
> >> +	smp_wmb();
> > If it loops forever, that's broken hardware design; store buffers need to
> > drain. I don't think we should add unconditional barriers to bodge this.
> 
> The comment is a bit exaggerating the situation, but it's undeterministic and you see the
> measurements above. Something is wrong in the qspinlocks code, please consider this patch
> "RFC", but something has to be done here.

The qspinlock code has been TLA+ modelled and has had extensive memory
ordering analysis. It has had lots of runtime on extremely large x86,
arm64, and Power machines. I'm fairly confident there is nothing wrong.

What I do think is more likely is that your platform is broken, it
wouldn't be the first MIPS that's got store-buffers completely wrong,
see commit:

  a30718868915 ("MIPS: Change definition of cpu_relax() for Loongson-3")

Working around micro arch store-buffer issues is not something the
generic code is for.
diff mbox series

Patch

diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153..10e497a 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -89,6 +89,11 @@  void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 		return;
 	}
 	WRITE_ONCE(prev->next, node);
+	/*
+	 * This is necessary to make sure that the corresponding "while" in the
+	 * mcs_spin_unlock() doesn't loop forever
+	 */
+	smp_wmb();
 
 	/* Wait until the lock holder passes the lock down. */
 	arch_mcs_spin_lock_contended(&node->locked);
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cbff6ba..577fe01 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -469,6 +469,12 @@  void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 
 		/* Link @node into the waitqueue. */
 		WRITE_ONCE(prev->next, node);
+		/*
+		 * This is necessary to make sure that the corresponding
+		 * smp_cond_load_relaxed() below (running on another core)
+		 * doesn't spin forever.
+		 */
+		smp_wmb();
 
 		pv_wait_node(node, prev);
 		arch_mcs_spin_lock_contended(&node->locked);