Message ID | 1550095217-12047-1-git-send-email-longman@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | locking/rwsem: Rwsem rearchitecture part 0 | expand |
On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote: > v4: > - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c. > > v3: > - Optimize __down_read_trylock() for the uncontended case as suggested > by Linus. > > v2: > - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ. > - Update performance test data in patch 1. > > The goal of this patchset is to remove the architecture specific files > for rwsem-xadd to make it easer to add enhancements in the later rwsem > patches. It also removes the legacy rwsem-spinlock.c file and make all > the architectures use one single implementation of rwsem - rwsem-xadd.c. > > Waiman Long (3): > locking/rwsem: Remove arch specific rwsem files > locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all > archs > locking/rwsem: Optimize down_read_trylock() Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> with the caveat that I'm happy to exchange patch 3 back to my earlier suggestion in case Will expesses concerns wrt the ARM64 performance of Linus' suggestion.
On 02/14/2019 05:37 AM, Peter Zijlstra wrote: > On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote: >> v4: >> - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c. >> >> v3: >> - Optimize __down_read_trylock() for the uncontended case as suggested >> by Linus. >> >> v2: >> - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ. >> - Update performance test data in patch 1. >> >> The goal of this patchset is to remove the architecture specific files >> for rwsem-xadd to make it easer to add enhancements in the later rwsem >> patches. It also removes the legacy rwsem-spinlock.c file and make all >> the architectures use one single implementation of rwsem - rwsem-xadd.c. >> >> Waiman Long (3): >> locking/rwsem: Remove arch specific rwsem files >> locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all >> archs >> locking/rwsem: Optimize down_read_trylock() > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > with the caveat that I'm happy to exchange patch 3 back to my earlier > suggestion in case Will expesses concerns wrt the ARM64 performance of > Linus' suggestion. I inserted a few lock event counters into the rwsem trylock code: static inline int __down_read_trylock(struct rw_semaphore *sem) { /* * Optimize for the case when the rwsem is not locked at all. */ long tmp = RWSEM_UNLOCKED_VALUE; lockevent_inc(rwsem_rtrylock); do { if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, tmp + RWSEM_ACTIVE_READ_BIAS)) { rwsem_set_reader_owned(sem); return 1; } lockevent_inc(rwsem_rtrylock_retry); } while (tmp >= 0); lockevent_inc(rwsem_rtrylock_fail); return 0; } static inline int __down_write_trylock(struct rw_semaphore *sem) { long tmp; lockevent_inc(rwsem_wtrylock); tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE, RWSEM_ACTIVE_WRITE_BIAS); if (tmp == RWSEM_UNLOCKED_VALUE) { rwsem_set_owner(sem); return true; } lockevent_inc(rwsem_wtrylock_fail); return false; } I booted the new kernel on a 4-socket 56-core 112-thread Broadwell system. The counter values 1) After bootup: rwsem_rtrylock=784029 rwsem_rtrylock_fail=59 rwsem_rtrylock_retry=394 rwsem_wtrylock=18284 rwsem_wtrylock_fail=230 2) After parallel kernel build (-j112): rwsem_rtrylock=338667559 rwsem_rtrylock_fail=18 rwsem_rtrylock_retry=51 rwsem_wtrylock=17016332 rwsem_wtrylock_fail=98058 At least for these two use cases, try-for-ownership as suggested by Linus is the right choice. Cheers, Longman
On Thu, Feb 14, 2019 at 11:37:15AM +0100, Peter Zijlstra wrote: > On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote: > > v4: > > - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c. > > > > v3: > > - Optimize __down_read_trylock() for the uncontended case as suggested > > by Linus. > > > > v2: > > - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ. > > - Update performance test data in patch 1. > > > > The goal of this patchset is to remove the architecture specific files > > for rwsem-xadd to make it easer to add enhancements in the later rwsem > > patches. It also removes the legacy rwsem-spinlock.c file and make all > > the architectures use one single implementation of rwsem - rwsem-xadd.c. > > > > Waiman Long (3): > > locking/rwsem: Remove arch specific rwsem files > > locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all > > archs > > locking/rwsem: Optimize down_read_trylock() > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > with the caveat that I'm happy to exchange patch 3 back to my earlier > suggestion in case Will expesses concerns wrt the ARM64 performance of > Linus' suggestion. Right, the current proposal doesn't work well for us, unfortunately. Which was your earlier suggestion? Will
On 02/15/2019 01:40 PM, Will Deacon wrote: > On Thu, Feb 14, 2019 at 11:37:15AM +0100, Peter Zijlstra wrote: >> On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote: >>> v4: >>> - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c. >>> >>> v3: >>> - Optimize __down_read_trylock() for the uncontended case as suggested >>> by Linus. >>> >>> v2: >>> - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ. >>> - Update performance test data in patch 1. >>> >>> The goal of this patchset is to remove the architecture specific files >>> for rwsem-xadd to make it easer to add enhancements in the later rwsem >>> patches. It also removes the legacy rwsem-spinlock.c file and make all >>> the architectures use one single implementation of rwsem - rwsem-xadd.c. >>> >>> Waiman Long (3): >>> locking/rwsem: Remove arch specific rwsem files >>> locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all >>> archs >>> locking/rwsem: Optimize down_read_trylock() >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> >> >> with the caveat that I'm happy to exchange patch 3 back to my earlier >> suggestion in case Will expesses concerns wrt the ARM64 performance of >> Linus' suggestion. > Right, the current proposal doesn't work well for us, unfortunately. Which > was your earlier suggestion? > > Will In my posting yesterday, I showed that most of the trylocks done were actually uncontended. Assuming that pattern hold for the most of the workloads, it will not that bad after all. -Longman
On Fri, Feb 15, 2019 at 01:58:34PM -0500, Waiman Long wrote: > On 02/15/2019 01:40 PM, Will Deacon wrote: > > On Thu, Feb 14, 2019 at 11:37:15AM +0100, Peter Zijlstra wrote: > >> On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote: > >>> v4: > >>> - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c. > >>> > >>> v3: > >>> - Optimize __down_read_trylock() for the uncontended case as suggested > >>> by Linus. > >>> > >>> v2: > >>> - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ. > >>> - Update performance test data in patch 1. > >>> > >>> The goal of this patchset is to remove the architecture specific files > >>> for rwsem-xadd to make it easer to add enhancements in the later rwsem > >>> patches. It also removes the legacy rwsem-spinlock.c file and make all > >>> the architectures use one single implementation of rwsem - rwsem-xadd.c. > >>> > >>> Waiman Long (3): > >>> locking/rwsem: Remove arch specific rwsem files > >>> locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all > >>> archs > >>> locking/rwsem: Optimize down_read_trylock() > >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > >> > >> with the caveat that I'm happy to exchange patch 3 back to my earlier > >> suggestion in case Will expesses concerns wrt the ARM64 performance of > >> Linus' suggestion. > > Right, the current proposal doesn't work well for us, unfortunately. Which > > was your earlier suggestion? > > > > Will > > In my posting yesterday, I showed that most of the trylocks done were > actually uncontended. Assuming that pattern hold for the most of the > workloads, it will not that bad after all. That's fair enough; if you're going to sit in a tight trylock() loop like the benchmark does, then you're much better off just calling lock() if you care at all about scalability. Will