Message ID | 20200113150735.21956-1-longman@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] locking/osq: Use optimized spinning loop for arm64 | expand |
On Mon, Jan 13, 2020 at 10:07:35AM -0500, Waiman Long wrote: > Arm64 has a more optimized spinning loop (atomic_cond_read_acquire) > using wfe for spinlock that can boost performance of sibling threads > by putting the current cpu to a wait state that is broken only when > the monitored variable changes or an external event happens. > > OSQ has a more complicated spinning loop. Besides the lock value, it > also checks for need_resched() and vcpu_is_preempted(). The check for > need_resched() is not a problem as it is only set by the tick interrupt > handler. That will be detected by the spinning cpu right after iret. > > The vcpu_is_preempted() check, however, is a problem as changes to the > preempt state of of previous node will not affect the wait state. For > ARM64, vcpu_is_preempted is not currently defined and so is a no-op. > Will has indicated that he is planning to para-virtualize wfe instead > of defining vcpu_is_preempted for PV support. So just add a comment in > arch/arm64/include/asm/spinlock.h to indicate that vcpu_is_preempted() > should not be defined as suggested. > > On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking > microbenchmark was run for 10s with and without the patch. The > performance numbers before patch were: > > Running locktest with mutex [runtime = 10s, load = 1] > Threads = 224, Min/Mean/Max = 316/123,143/2,121,269 > Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s > > After patch, the numbers were: > > Running locktest with mutex [runtime = 10s, load = 1] > Threads = 224, Min/Mean/Max = 334/147,836/1,304,787 > Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s > > So there was about 20% performance improvement. > > Signed-off-by: Waiman Long <longman@redhat.com> > --- > arch/arm64/include/asm/spinlock.h | 9 +++++++++ > kernel/locking/osq_lock.c | 17 ++++------------- > 2 files changed, 13 insertions(+), 13 deletions(-) Acked-by: Will Deacon <will@kernel.org> Thanks, Will
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h index b093b287babf..102404dc1e13 100644 --- a/arch/arm64/include/asm/spinlock.h +++ b/arch/arm64/include/asm/spinlock.h @@ -11,4 +11,13 @@ /* See include/linux/spinlock.h */ #define smp_mb__after_spinlock() smp_mb() +/* + * Changing this will break osq_lock() thanks to the call inside + * smp_cond_load_relaxed(). + * + * See: + * https://lore.kernel.org/lkml/20200110100612.GC2827@hirez.programming.kicks-ass.net + */ +#define vcpu_is_preempted(cpu) false + #endif /* __ASM_SPINLOCK_H */ diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 6ef600aa0f47..e42893f2fcbc 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -134,20 +134,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) * cmpxchg in an attempt to undo our queueing. */ - while (!READ_ONCE(node->locked)) { - /* - * If we need to reschedule bail... so we can block. - * Use vcpu_is_preempted() to avoid waiting for a preempted - * lock holder: - */ - if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) - goto unqueue; - - cpu_relax(); - } - return true; + if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || + vcpu_is_preempted(node_cpu(node->prev)))) + return true; -unqueue: + /* unqueue */ /* * Step - A -- stabilize @prev *
Arm64 has a more optimized spinning loop (atomic_cond_read_acquire) using wfe for spinlock that can boost performance of sibling threads by putting the current cpu to a wait state that is broken only when the monitored variable changes or an external event happens. OSQ has a more complicated spinning loop. Besides the lock value, it also checks for need_resched() and vcpu_is_preempted(). The check for need_resched() is not a problem as it is only set by the tick interrupt handler. That will be detected by the spinning cpu right after iret. The vcpu_is_preempted() check, however, is a problem as changes to the preempt state of of previous node will not affect the wait state. For ARM64, vcpu_is_preempted is not currently defined and so is a no-op. Will has indicated that he is planning to para-virtualize wfe instead of defining vcpu_is_preempted for PV support. So just add a comment in arch/arm64/include/asm/spinlock.h to indicate that vcpu_is_preempted() should not be defined as suggested. On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking microbenchmark was run for 10s with and without the patch. The performance numbers before patch were: Running locktest with mutex [runtime = 10s, load = 1] Threads = 224, Min/Mean/Max = 316/123,143/2,121,269 Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s After patch, the numbers were: Running locktest with mutex [runtime = 10s, load = 1] Threads = 224, Min/Mean/Max = 334/147,836/1,304,787 Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s So there was about 20% performance improvement. Signed-off-by: Waiman Long <longman@redhat.com> --- arch/arm64/include/asm/spinlock.h | 9 +++++++++ kernel/locking/osq_lock.c | 17 ++++------------- 2 files changed, 13 insertions(+), 13 deletions(-)