Message ID | 20200112235854.32089-1-longman@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] locking/osq: Use optimized spinning loop for arm64 | expand |
Hi Waiman, On 2020/1/13 7:58, Waiman Long wrote: > Arm64 has a more optimized spinning loop (atomic_cond_read_acquire) > for spinlock that can boost performance of sibling threads by putting > the current cpu to a shallow sleep state that is woken up only when > the monitored variable changes or an external event happens. > > OSQ has a more complicated spinning loop. Besides the lock value, it > also checks for need_resched() and vcpu_is_preempted(). The check for > need_resched() is not a problem as it is only set by the tick interrupt > handler. That will be detected by the spinning cpu right after iret. > > The vcpu_is_preempted() check, however, is a problem as changes to the > preempt state of of previous node will not affect the sleep state. For > ARM64, vcpu_is_preempted is not defined and so is a no-op. To guard > against future addition of vcpu_is_preempted() to arm64, code is added > to cause build error when vcpu_is_preempted becomes defined in arm64 > without the corresponding changes in the OSQ spinning code. Recently, I am supporting vcpu_is_preempted() for arm64. There is a patch set which do this[1]. [1] https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/ > > On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking > microbenchmark was run for 10s with and without the patch. The > performance numbers before patch were: > > Running locktest with mutex [runtime = 10s, load = 1] > Threads = 224, Min/Mean/Max = 316/123,143/2,121,269 > Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s > > After patch, the numbers were: > > Running locktest with mutex [runtime = 10s, load = 1] > Threads = 224, Min/Mean/Max = 334/147,836/1,304,787 > Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s > > So there was about 20% performance improvement. > > Signed-off-by: Waiman Long <longman@redhat.com> > --- > arch/arm64/include/asm/barrier.h | 10 ++++++++++ > kernel/locking/osq_lock.c | 25 ++++++++++++------------- > 2 files changed, 22 insertions(+), 13 deletions(-) > > diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h > index 7d9cc5ec4971..8eb5f1239885 100644 > --- a/arch/arm64/include/asm/barrier.h > +++ b/arch/arm64/include/asm/barrier.h > @@ -152,6 +152,16 @@ do { \ > VAL; \ > }) > > +/* > + * In osq_lock(), smp_cond_load_relaxed() is called with a condition > + * that includes vcpu_is_preempted(). For arm64, vcpu_is_preempted is not > + * currently defined. So it is a no-op. If vcpu_is_preempted is defined in > + * the future, smp_cond_load_relaxed() will not response to changes in the > + * preempt state in a timely manner. So code changes will have to be made > + * to address this deficiency. > + */ > +#define vcpu_is_preempted_not_used > + > #define smp_cond_load_acquire(ptr, cond_expr) \ > ({ \ > typeof(ptr) __PTR = (ptr); \ > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c > index 6ef600aa0f47..69ec5161c3cc 100644 > --- a/kernel/locking/osq_lock.c > +++ b/kernel/locking/osq_lock.c > @@ -13,6 +13,14 @@ > */ > static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); > > +/* > + * The optimized smp_cond_load_relaxed() spin loop should not be used with > + * vcpu_is_preempted defined. > + */ > +#if defined(vcpu_is_preempted) && defined(vcpu_is_preempted_not_used) > +#error "vcpu_is_preempted() inside smp_cond_load_relaxed() may not work!" > +#endif > + > /* > * We use the value 0 to represent "no CPU", thus the encoded value > * will be the CPU number incremented by 1. > @@ -134,20 +142,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) > * cmpxchg in an attempt to undo our queueing. > */ > > - while (!READ_ONCE(node->locked)) { > - /* > - * If we need to reschedule bail... so we can block. > - * Use vcpu_is_preempted() to avoid waiting for a preempted > - * lock holder: > - */ > - if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) > - goto unqueue; > - > - cpu_relax(); > - } > - return true; > + if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || > + vcpu_is_preempted(node_cpu(node->prev)))) > + return true; > > -unqueue: > + /* unqueue */ > /* > * Step - A -- stabilize @prev > * > Thanks, Zengruan
[+Marc] On Sun, Jan 12, 2020 at 06:58:54PM -0500, Waiman Long wrote: > Arm64 has a more optimized spinning loop (atomic_cond_read_acquire) > for spinlock that can boost performance of sibling threads by putting > the current cpu to a shallow sleep state that is woken up only when > the monitored variable changes or an external event happens. > > OSQ has a more complicated spinning loop. Besides the lock value, it > also checks for need_resched() and vcpu_is_preempted(). The check for > need_resched() is not a problem as it is only set by the tick interrupt > handler. That will be detected by the spinning cpu right after iret. > > The vcpu_is_preempted() check, however, is a problem as changes to the > preempt state of of previous node will not affect the sleep state. For > ARM64, vcpu_is_preempted is not defined and so is a no-op. To guard > against future addition of vcpu_is_preempted() to arm64, code is added > to cause build error when vcpu_is_preempted becomes defined in arm64 > without the corresponding changes in the OSQ spinning code. > > On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking > microbenchmark was run for 10s with and without the patch. The > performance numbers before patch were: > > Running locktest with mutex [runtime = 10s, load = 1] > Threads = 224, Min/Mean/Max = 316/123,143/2,121,269 > Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s > > After patch, the numbers were: > > Running locktest with mutex [runtime = 10s, load = 1] > Threads = 224, Min/Mean/Max = 334/147,836/1,304,787 > Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s > > So there was about 20% performance improvement. > > Signed-off-by: Waiman Long <longman@redhat.com> > --- > arch/arm64/include/asm/barrier.h | 10 ++++++++++ > kernel/locking/osq_lock.c | 25 ++++++++++++------------- > 2 files changed, 22 insertions(+), 13 deletions(-) > > diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h > index 7d9cc5ec4971..8eb5f1239885 100644 > --- a/arch/arm64/include/asm/barrier.h > +++ b/arch/arm64/include/asm/barrier.h > @@ -152,6 +152,16 @@ do { \ > VAL; \ > }) > > +/* > + * In osq_lock(), smp_cond_load_relaxed() is called with a condition > + * that includes vcpu_is_preempted(). For arm64, vcpu_is_preempted is not > + * currently defined. So it is a no-op. If vcpu_is_preempted is defined in > + * the future, smp_cond_load_relaxed() will not response to changes in the > + * preempt state in a timely manner. So code changes will have to be made > + * to address this deficiency. > + */ > +#define vcpu_is_preempted_not_used > + > #define smp_cond_load_acquire(ptr, cond_expr) \ > ({ \ > typeof(ptr) __PTR = (ptr); \ > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c > index 6ef600aa0f47..69ec5161c3cc 100644 > --- a/kernel/locking/osq_lock.c > +++ b/kernel/locking/osq_lock.c > @@ -13,6 +13,14 @@ > */ > static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); > > +/* > + * The optimized smp_cond_load_relaxed() spin loop should not be used with > + * vcpu_is_preempted defined. > + */ > +#if defined(vcpu_is_preempted) && defined(vcpu_is_preempted_not_used) > +#error "vcpu_is_preempted() inside smp_cond_load_relaxed() may not work!" > +#endif Although I appreciate you going the extra mile for arm64 (thanks!), I think this is probably a bit overkill given that I don't plan to merge the series from Zengruan any time soon. Instead, how about just defining vcpu_is_preempted in arch/arm64/include/asm/spinlock.h with a comment: /* * Changing this will break osq_lock() thanks to the call inside * smp_cond_load_relaxed(). * * See: * https://lore.kernel.org/lkml/20200110100612.GC2827@hirez.programming.kicks-ass.net */ #define vcpu_is_preempted(cpu) false So we'll notice that when somebody tries to change it. Will
On 1/13/20 6:57 AM, Will Deacon wrote: > [+Marc] > > On Sun, Jan 12, 2020 at 06:58:54PM -0500, Waiman Long wrote: >> Arm64 has a more optimized spinning loop (atomic_cond_read_acquire) >> for spinlock that can boost performance of sibling threads by putting >> the current cpu to a shallow sleep state that is woken up only when >> the monitored variable changes or an external event happens. >> >> OSQ has a more complicated spinning loop. Besides the lock value, it >> also checks for need_resched() and vcpu_is_preempted(). The check for >> need_resched() is not a problem as it is only set by the tick interrupt >> handler. That will be detected by the spinning cpu right after iret. >> >> The vcpu_is_preempted() check, however, is a problem as changes to the >> preempt state of of previous node will not affect the sleep state. For >> ARM64, vcpu_is_preempted is not defined and so is a no-op. To guard >> against future addition of vcpu_is_preempted() to arm64, code is added >> to cause build error when vcpu_is_preempted becomes defined in arm64 >> without the corresponding changes in the OSQ spinning code. >> >> On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking >> microbenchmark was run for 10s with and without the patch. The >> performance numbers before patch were: >> >> Running locktest with mutex [runtime = 10s, load = 1] >> Threads = 224, Min/Mean/Max = 316/123,143/2,121,269 >> Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s >> >> After patch, the numbers were: >> >> Running locktest with mutex [runtime = 10s, load = 1] >> Threads = 224, Min/Mean/Max = 334/147,836/1,304,787 >> Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s >> >> So there was about 20% performance improvement. >> >> Signed-off-by: Waiman Long <longman@redhat.com> >> --- >> arch/arm64/include/asm/barrier.h | 10 ++++++++++ >> kernel/locking/osq_lock.c | 25 ++++++++++++------------- >> 2 files changed, 22 insertions(+), 13 deletions(-) >> >> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h >> index 7d9cc5ec4971..8eb5f1239885 100644 >> --- a/arch/arm64/include/asm/barrier.h >> +++ b/arch/arm64/include/asm/barrier.h >> @@ -152,6 +152,16 @@ do { \ >> VAL; \ >> }) >> >> +/* >> + * In osq_lock(), smp_cond_load_relaxed() is called with a condition >> + * that includes vcpu_is_preempted(). For arm64, vcpu_is_preempted is not >> + * currently defined. So it is a no-op. If vcpu_is_preempted is defined in >> + * the future, smp_cond_load_relaxed() will not response to changes in the >> + * preempt state in a timely manner. So code changes will have to be made >> + * to address this deficiency. >> + */ >> +#define vcpu_is_preempted_not_used >> + >> #define smp_cond_load_acquire(ptr, cond_expr) \ >> ({ \ >> typeof(ptr) __PTR = (ptr); \ >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c >> index 6ef600aa0f47..69ec5161c3cc 100644 >> --- a/kernel/locking/osq_lock.c >> +++ b/kernel/locking/osq_lock.c >> @@ -13,6 +13,14 @@ >> */ >> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); >> >> +/* >> + * The optimized smp_cond_load_relaxed() spin loop should not be used with >> + * vcpu_is_preempted defined. >> + */ >> +#if defined(vcpu_is_preempted) && defined(vcpu_is_preempted_not_used) >> +#error "vcpu_is_preempted() inside smp_cond_load_relaxed() may not work!" >> +#endif > Although I appreciate you going the extra mile for arm64 (thanks!), I think > this is probably a bit overkill given that I don't plan to merge the series > from Zengruan any time soon. Instead, how about just defining > vcpu_is_preempted in arch/arm64/include/asm/spinlock.h with a comment: > > > /* > * Changing this will break osq_lock() thanks to the call inside > * smp_cond_load_relaxed(). > * > * See: > * https://lore.kernel.org/lkml/20200110100612.GC2827@hirez.programming.kicks-ass.net > */ > #define vcpu_is_preempted(cpu) false > > > So we'll notice that when somebody tries to change it. Yes, that works for me. I just want to make sure that if any changes to add vcpu_is_preempted to arm64 in the future will get caught. Cheers, Longman
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h index 7d9cc5ec4971..8eb5f1239885 100644 --- a/arch/arm64/include/asm/barrier.h +++ b/arch/arm64/include/asm/barrier.h @@ -152,6 +152,16 @@ do { \ VAL; \ }) +/* + * In osq_lock(), smp_cond_load_relaxed() is called with a condition + * that includes vcpu_is_preempted(). For arm64, vcpu_is_preempted is not + * currently defined. So it is a no-op. If vcpu_is_preempted is defined in + * the future, smp_cond_load_relaxed() will not response to changes in the + * preempt state in a timely manner. So code changes will have to be made + * to address this deficiency. + */ +#define vcpu_is_preempted_not_used + #define smp_cond_load_acquire(ptr, cond_expr) \ ({ \ typeof(ptr) __PTR = (ptr); \ diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 6ef600aa0f47..69ec5161c3cc 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -13,6 +13,14 @@ */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); +/* + * The optimized smp_cond_load_relaxed() spin loop should not be used with + * vcpu_is_preempted defined. + */ +#if defined(vcpu_is_preempted) && defined(vcpu_is_preempted_not_used) +#error "vcpu_is_preempted() inside smp_cond_load_relaxed() may not work!" +#endif + /* * We use the value 0 to represent "no CPU", thus the encoded value * will be the CPU number incremented by 1. @@ -134,20 +142,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) * cmpxchg in an attempt to undo our queueing. */ - while (!READ_ONCE(node->locked)) { - /* - * If we need to reschedule bail... so we can block. - * Use vcpu_is_preempted() to avoid waiting for a preempted - * lock holder: - */ - if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) - goto unqueue; - - cpu_relax(); - } - return true; + if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || + vcpu_is_preempted(node_cpu(node->prev)))) + return true; -unqueue: + /* unqueue */ /* * Step - A -- stabilize @prev *
Arm64 has a more optimized spinning loop (atomic_cond_read_acquire) for spinlock that can boost performance of sibling threads by putting the current cpu to a shallow sleep state that is woken up only when the monitored variable changes or an external event happens. OSQ has a more complicated spinning loop. Besides the lock value, it also checks for need_resched() and vcpu_is_preempted(). The check for need_resched() is not a problem as it is only set by the tick interrupt handler. That will be detected by the spinning cpu right after iret. The vcpu_is_preempted() check, however, is a problem as changes to the preempt state of of previous node will not affect the sleep state. For ARM64, vcpu_is_preempted is not defined and so is a no-op. To guard against future addition of vcpu_is_preempted() to arm64, code is added to cause build error when vcpu_is_preempted becomes defined in arm64 without the corresponding changes in the OSQ spinning code. On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking microbenchmark was run for 10s with and without the patch. The performance numbers before patch were: Running locktest with mutex [runtime = 10s, load = 1] Threads = 224, Min/Mean/Max = 316/123,143/2,121,269 Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s After patch, the numbers were: Running locktest with mutex [runtime = 10s, load = 1] Threads = 224, Min/Mean/Max = 334/147,836/1,304,787 Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s So there was about 20% performance improvement. Signed-off-by: Waiman Long <longman@redhat.com> --- arch/arm64/include/asm/barrier.h | 10 ++++++++++ kernel/locking/osq_lock.c | 25 ++++++++++++------------- 2 files changed, 22 insertions(+), 13 deletions(-)