Message ID | 1550089932-6888-3-git-send-email-longman@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | locking/rwsem: Remove arch specific rwsem files | expand |
On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote: > Modify __down_read_trylock() to optimize for an unlocked rwsem and make > it generate slightly better code. > > Before this patch, down_read_trylock: > > 0x0000000000000000 <+0>: callq 0x5 <down_read_trylock+5> > 0x0000000000000005 <+5>: jmp 0x18 <down_read_trylock+24> > 0x0000000000000007 <+7>: lea 0x1(%rdx),%rcx > 0x000000000000000b <+11>: mov %rdx,%rax > 0x000000000000000e <+14>: lock cmpxchg %rcx,(%rdi) > 0x0000000000000013 <+19>: cmp %rax,%rdx > 0x0000000000000016 <+22>: je 0x23 <down_read_trylock+35> > 0x0000000000000018 <+24>: mov (%rdi),%rdx > 0x000000000000001b <+27>: test %rdx,%rdx > 0x000000000000001e <+30>: jns 0x7 <down_read_trylock+7> > 0x0000000000000020 <+32>: xor %eax,%eax > 0x0000000000000022 <+34>: retq > 0x0000000000000023 <+35>: mov %gs:0x0,%rax > 0x000000000000002c <+44>: or $0x3,%rax > 0x0000000000000030 <+48>: mov %rax,0x20(%rdi) > 0x0000000000000034 <+52>: mov $0x1,%eax > 0x0000000000000039 <+57>: retq > > After patch, down_read_trylock: > > 0x0000000000000000 <+0>: callq 0x5 <down_read_trylock+5> > 0x0000000000000005 <+5>: xor %eax,%eax > 0x0000000000000007 <+7>: lea 0x1(%rax),%rdx > 0x000000000000000b <+11>: lock cmpxchg %rdx,(%rdi) > 0x0000000000000010 <+16>: jne 0x29 <down_read_trylock+41> > 0x0000000000000012 <+18>: mov %gs:0x0,%rax > 0x000000000000001b <+27>: or $0x3,%rax > 0x000000000000001f <+31>: mov %rax,0x20(%rdi) > 0x0000000000000023 <+35>: mov $0x1,%eax > 0x0000000000000028 <+40>: retq > 0x0000000000000029 <+41>: test %rax,%rax > 0x000000000000002c <+44>: jns 0x7 <down_read_trylock+7> > 0x000000000000002e <+46>: xor %eax,%eax > 0x0000000000000030 <+48>: retq > > By using a rwsem microbenchmark, the down_read_trylock() rate (with a > load of 10 to lengthen the lock critical section) on a x86-64 system > before and after the patch were: > > Before Patch After Patch > # of Threads rlock rlock > ------------ ----- ----- > 1 14,496 14,716 > 2 8,644 8,453 > 4 6,799 6,983 > 8 5,664 7,190 > > On a ARM64 system, the performance results were: > > Before Patch After Patch > # of Threads rlock rlock > ------------ ----- ----- > 1 23,676 24,488 > 2 7,697 9,502 > 4 4,945 3,440 > 8 2,641 1,603 Urgh, yes LL/SC is the obvious exception that can actually do better here :/ Will, what say you?
On 02/14/2019 05:33 AM, Peter Zijlstra wrote: > On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote: >> Modify __down_read_trylock() to optimize for an unlocked rwsem and make >> it generate slightly better code. >> >> Before this patch, down_read_trylock: >> >> 0x0000000000000000 <+0>: callq 0x5 <down_read_trylock+5> >> 0x0000000000000005 <+5>: jmp 0x18 <down_read_trylock+24> >> 0x0000000000000007 <+7>: lea 0x1(%rdx),%rcx >> 0x000000000000000b <+11>: mov %rdx,%rax >> 0x000000000000000e <+14>: lock cmpxchg %rcx,(%rdi) >> 0x0000000000000013 <+19>: cmp %rax,%rdx >> 0x0000000000000016 <+22>: je 0x23 <down_read_trylock+35> >> 0x0000000000000018 <+24>: mov (%rdi),%rdx >> 0x000000000000001b <+27>: test %rdx,%rdx >> 0x000000000000001e <+30>: jns 0x7 <down_read_trylock+7> >> 0x0000000000000020 <+32>: xor %eax,%eax >> 0x0000000000000022 <+34>: retq >> 0x0000000000000023 <+35>: mov %gs:0x0,%rax >> 0x000000000000002c <+44>: or $0x3,%rax >> 0x0000000000000030 <+48>: mov %rax,0x20(%rdi) >> 0x0000000000000034 <+52>: mov $0x1,%eax >> 0x0000000000000039 <+57>: retq >> >> After patch, down_read_trylock: >> >> 0x0000000000000000 <+0>: callq 0x5 <down_read_trylock+5> >> 0x0000000000000005 <+5>: xor %eax,%eax >> 0x0000000000000007 <+7>: lea 0x1(%rax),%rdx >> 0x000000000000000b <+11>: lock cmpxchg %rdx,(%rdi) >> 0x0000000000000010 <+16>: jne 0x29 <down_read_trylock+41> >> 0x0000000000000012 <+18>: mov %gs:0x0,%rax >> 0x000000000000001b <+27>: or $0x3,%rax >> 0x000000000000001f <+31>: mov %rax,0x20(%rdi) >> 0x0000000000000023 <+35>: mov $0x1,%eax >> 0x0000000000000028 <+40>: retq >> 0x0000000000000029 <+41>: test %rax,%rax >> 0x000000000000002c <+44>: jns 0x7 <down_read_trylock+7> >> 0x000000000000002e <+46>: xor %eax,%eax >> 0x0000000000000030 <+48>: retq >> >> By using a rwsem microbenchmark, the down_read_trylock() rate (with a >> load of 10 to lengthen the lock critical section) on a x86-64 system >> before and after the patch were: >> >> Before Patch After Patch >> # of Threads rlock rlock >> ------------ ----- ----- >> 1 14,496 14,716 >> 2 8,644 8,453 >> 4 6,799 6,983 >> 8 5,664 7,190 >> >> On a ARM64 system, the performance results were: >> >> Before Patch After Patch >> # of Threads rlock rlock >> ------------ ----- ----- >> 1 23,676 24,488 >> 2 7,697 9,502 >> 4 4,945 3,440 >> 8 2,641 1,603 > Urgh, yes LL/SC is the obvious exception that can actually do better > here :/ > > Will, what say you? The ARM64 result is what I would have expected given that the change was to optimize for the uncontended case. The x86-64 result is kind of an anomaly to me, but I haven't bothered to dig into that. Cheers, Longman
On Thu, Feb 14, 2019 at 6:53 AM Waiman Long <longman@redhat.com> wrote: > > The ARM64 result is what I would have expected given that the change was > to optimize for the uncontended case. The x86-64 result is kind of an > anomaly to me, but I haven't bothered to dig into that. I would say that the ARM result is what I'd expect from something that scales badly to begin with. The x86-64 result is the expected one: yes, the cmpxchg is done one extra time, but it results in fewer cache transitions (the cacheline never goes into "shared" state), and cache transitions are what matter. The cost of re-doing the instruction should be low. The cacheline ping-pong and the cache coherency messages is what hurts. So I actually think both are very easily explained. The x86-64 number improves, because there is less cache coherency traffic. The arm64 numbers scaled horribly even before, and that's because there is too much ping-pong, and it's probably because there is no "stickiness" to the cacheline to the core, and thus adding the extra loop can make the ping-pong issue even worse because now there is more of it. The cachelines not sticking at all to a core probably is good for fairness issues (in particular, sticking *too* much can cause horrible issues), but it's absolutely horrible if it means that you lose the cacheline even before you get to complete the second cmpxchg. Linus
On Thu, Feb 14, 2019 at 11:33:33AM +0100, Peter Zijlstra wrote: > On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote: > > Modify __down_read_trylock() to optimize for an unlocked rwsem and make > > it generate slightly better code. > > > > Before this patch, down_read_trylock: > > > > 0x0000000000000000 <+0>: callq 0x5 <down_read_trylock+5> > > 0x0000000000000005 <+5>: jmp 0x18 <down_read_trylock+24> > > 0x0000000000000007 <+7>: lea 0x1(%rdx),%rcx > > 0x000000000000000b <+11>: mov %rdx,%rax > > 0x000000000000000e <+14>: lock cmpxchg %rcx,(%rdi) > > 0x0000000000000013 <+19>: cmp %rax,%rdx > > 0x0000000000000016 <+22>: je 0x23 <down_read_trylock+35> > > 0x0000000000000018 <+24>: mov (%rdi),%rdx > > 0x000000000000001b <+27>: test %rdx,%rdx > > 0x000000000000001e <+30>: jns 0x7 <down_read_trylock+7> > > 0x0000000000000020 <+32>: xor %eax,%eax > > 0x0000000000000022 <+34>: retq > > 0x0000000000000023 <+35>: mov %gs:0x0,%rax > > 0x000000000000002c <+44>: or $0x3,%rax > > 0x0000000000000030 <+48>: mov %rax,0x20(%rdi) > > 0x0000000000000034 <+52>: mov $0x1,%eax > > 0x0000000000000039 <+57>: retq > > > > After patch, down_read_trylock: > > > > 0x0000000000000000 <+0>: callq 0x5 <down_read_trylock+5> > > 0x0000000000000005 <+5>: xor %eax,%eax > > 0x0000000000000007 <+7>: lea 0x1(%rax),%rdx > > 0x000000000000000b <+11>: lock cmpxchg %rdx,(%rdi) > > 0x0000000000000010 <+16>: jne 0x29 <down_read_trylock+41> > > 0x0000000000000012 <+18>: mov %gs:0x0,%rax > > 0x000000000000001b <+27>: or $0x3,%rax > > 0x000000000000001f <+31>: mov %rax,0x20(%rdi) > > 0x0000000000000023 <+35>: mov $0x1,%eax > > 0x0000000000000028 <+40>: retq > > 0x0000000000000029 <+41>: test %rax,%rax > > 0x000000000000002c <+44>: jns 0x7 <down_read_trylock+7> > > 0x000000000000002e <+46>: xor %eax,%eax > > 0x0000000000000030 <+48>: retq > > > > By using a rwsem microbenchmark, the down_read_trylock() rate (with a > > load of 10 to lengthen the lock critical section) on a x86-64 system > > before and after the patch were: > > > > Before Patch After Patch > > # of Threads rlock rlock > > ------------ ----- ----- > > 1 14,496 14,716 > > 2 8,644 8,453 > > 4 6,799 6,983 > > 8 5,664 7,190 > > > > On a ARM64 system, the performance results were: > > > > Before Patch After Patch > > # of Threads rlock rlock > > ------------ ----- ----- > > 1 23,676 24,488 > > 2 7,697 9,502 > > 4 4,945 3,440 > > 8 2,641 1,603 > > Urgh, yes LL/SC is the obvious exception that can actually do better > here :/ > > Will, what say you? What machine were these numbers generated on and is it using LL/SC or LSE atomics for arm64? If you stick the microbenchmark somewhere, I can go play with a broader variety of h/w. Will
On Thu, Feb 14, 2019 at 9:51 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The arm64 numbers scaled horribly even before, and that's because > there is too much ping-pong, and it's probably because there is no > "stickiness" to the cacheline to the core, and thus adding the extra > loop can make the ping-pong issue even worse because now there is more > of it. Actually, if it's using the ll/sc, then I don't see why arm64 should even change. It doesn't really even change the pattern: the initial load of the value is just replaced with a "ll" that gets a non-zero value, and then we re-try without even doing the "sc" part. End result: exact same "load once, then do ll/sc to update". Just using a slightly different instruction pattern. But maybe "ll" does something different to the cacheline than a regular "ld"? Alternatively, the machine you used is using LSE, and the "swp" thing has some horrid behavior when it fails. So I take it back. I'm actually surprised that arm64 performs worse. I don't think it should. But numbers walk, bullshit talks, and it clearly does make for worse numbers on arm64. Linus
On 02/14/2019 01:02 PM, Will Deacon wrote: > On Thu, Feb 14, 2019 at 11:33:33AM +0100, Peter Zijlstra wrote: >> On Wed, Feb 13, 2019 at 03:32:12PM -0500, Waiman Long wrote: >>> Modify __down_read_trylock() to optimize for an unlocked rwsem and make >>> it generate slightly better code. >>> >>> Before this patch, down_read_trylock: >>> >>> 0x0000000000000000 <+0>: callq 0x5 <down_read_trylock+5> >>> 0x0000000000000005 <+5>: jmp 0x18 <down_read_trylock+24> >>> 0x0000000000000007 <+7>: lea 0x1(%rdx),%rcx >>> 0x000000000000000b <+11>: mov %rdx,%rax >>> 0x000000000000000e <+14>: lock cmpxchg %rcx,(%rdi) >>> 0x0000000000000013 <+19>: cmp %rax,%rdx >>> 0x0000000000000016 <+22>: je 0x23 <down_read_trylock+35> >>> 0x0000000000000018 <+24>: mov (%rdi),%rdx >>> 0x000000000000001b <+27>: test %rdx,%rdx >>> 0x000000000000001e <+30>: jns 0x7 <down_read_trylock+7> >>> 0x0000000000000020 <+32>: xor %eax,%eax >>> 0x0000000000000022 <+34>: retq >>> 0x0000000000000023 <+35>: mov %gs:0x0,%rax >>> 0x000000000000002c <+44>: or $0x3,%rax >>> 0x0000000000000030 <+48>: mov %rax,0x20(%rdi) >>> 0x0000000000000034 <+52>: mov $0x1,%eax >>> 0x0000000000000039 <+57>: retq >>> >>> After patch, down_read_trylock: >>> >>> 0x0000000000000000 <+0>: callq 0x5 <down_read_trylock+5> >>> 0x0000000000000005 <+5>: xor %eax,%eax >>> 0x0000000000000007 <+7>: lea 0x1(%rax),%rdx >>> 0x000000000000000b <+11>: lock cmpxchg %rdx,(%rdi) >>> 0x0000000000000010 <+16>: jne 0x29 <down_read_trylock+41> >>> 0x0000000000000012 <+18>: mov %gs:0x0,%rax >>> 0x000000000000001b <+27>: or $0x3,%rax >>> 0x000000000000001f <+31>: mov %rax,0x20(%rdi) >>> 0x0000000000000023 <+35>: mov $0x1,%eax >>> 0x0000000000000028 <+40>: retq >>> 0x0000000000000029 <+41>: test %rax,%rax >>> 0x000000000000002c <+44>: jns 0x7 <down_read_trylock+7> >>> 0x000000000000002e <+46>: xor %eax,%eax >>> 0x0000000000000030 <+48>: retq >>> >>> By using a rwsem microbenchmark, the down_read_trylock() rate (with a >>> load of 10 to lengthen the lock critical section) on a x86-64 system >>> before and after the patch were: >>> >>> Before Patch After Patch >>> # of Threads rlock rlock >>> ------------ ----- ----- >>> 1 14,496 14,716 >>> 2 8,644 8,453 >>> 4 6,799 6,983 >>> 8 5,664 7,190 >>> >>> On a ARM64 system, the performance results were: >>> >>> Before Patch After Patch >>> # of Threads rlock rlock >>> ------------ ----- ----- >>> 1 23,676 24,488 >>> 2 7,697 9,502 >>> 4 4,945 3,440 >>> 8 2,641 1,603 >> Urgh, yes LL/SC is the obvious exception that can actually do better >> here :/ >> >> Will, what say you? > What machine were these numbers generated on and is it using LL/SC or LSE > atomics for arm64? If you stick the microbenchmark somewhere, I can go play > with a broader variety of h/w. > > Will The machine is a 2-socket Cavium ThunderX2 99xx system with 64 cores and 256 threads. I was just using threads from the first socket for this test. The microbenchmark that I used is attached. I used the command "./run-locktest -ltryrwsem -r100 -i-10 -c10 -n<threads>" to generate the locking rates. The lscpu flags were: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm Cheers, Longman
On Thu, Feb 14, 2019 at 10:09:44AM -0800, Linus Torvalds wrote: > On Thu, Feb 14, 2019 at 9:51 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > The arm64 numbers scaled horribly even before, and that's because > > there is too much ping-pong, and it's probably because there is no > > "stickiness" to the cacheline to the core, and thus adding the extra > > loop can make the ping-pong issue even worse because now there is more > > of it. > > Actually, if it's using the ll/sc, then I don't see why arm64 should > even change. It doesn't really even change the pattern: the initial > load of the value is just replaced with a "ll" that gets a non-zero > value, and then we re-try without even doing the "sc" part. So our cmpxchg() has a prefetch-with-intent-to-modify instruction before the 'll' part, which will attempt to grab the line unique the first time round. The 'll' also has acquire semantics, so there's the chance for the micro-architecture to handle that badly too. I think that the problem with the proposed changed change is that whenever a reader tries to acquire an rwsem that is already held for read, it will always fail the first cmpxchg(), so in this situation the read path is considerably slower than before. > End result: exact same "load once, then do ll/sc to update". Just > using a slightly different instruction pattern. > > But maybe "ll" does something different to the cacheline than a regular "ld"? > > Alternatively, the machine you used is using LSE, and the "swp" thing > has some horrid behavior when it fails. Depending on where the data is, the LSE instructions may execute outside of the CPU (e.g. in a cache controller) and so could add latency to a failing CAS. Will
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h index 067e265..e0bcc11 100644 --- a/kernel/locking/rwsem.h +++ b/kernel/locking/rwsem.h @@ -175,14 +175,17 @@ static inline int __down_read_killable(struct rw_semaphore *sem) static inline int __down_read_trylock(struct rw_semaphore *sem) { - long tmp; + /* + * Optimize for the case when the rwsem is not locked at all. + */ + long tmp = RWSEM_UNLOCKED_VALUE; - while ((tmp = atomic_long_read(&sem->count)) >= 0) { - if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp, - tmp + RWSEM_ACTIVE_READ_BIAS)) { + do { + if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, + tmp + RWSEM_ACTIVE_READ_BIAS)) { return 1; } - } + } while (tmp >= 0); return 0; }
Modify __down_read_trylock() to optimize for an unlocked rwsem and make it generate slightly better code. Before this patch, down_read_trylock: 0x0000000000000000 <+0>: callq 0x5 <down_read_trylock+5> 0x0000000000000005 <+5>: jmp 0x18 <down_read_trylock+24> 0x0000000000000007 <+7>: lea 0x1(%rdx),%rcx 0x000000000000000b <+11>: mov %rdx,%rax 0x000000000000000e <+14>: lock cmpxchg %rcx,(%rdi) 0x0000000000000013 <+19>: cmp %rax,%rdx 0x0000000000000016 <+22>: je 0x23 <down_read_trylock+35> 0x0000000000000018 <+24>: mov (%rdi),%rdx 0x000000000000001b <+27>: test %rdx,%rdx 0x000000000000001e <+30>: jns 0x7 <down_read_trylock+7> 0x0000000000000020 <+32>: xor %eax,%eax 0x0000000000000022 <+34>: retq 0x0000000000000023 <+35>: mov %gs:0x0,%rax 0x000000000000002c <+44>: or $0x3,%rax 0x0000000000000030 <+48>: mov %rax,0x20(%rdi) 0x0000000000000034 <+52>: mov $0x1,%eax 0x0000000000000039 <+57>: retq After patch, down_read_trylock: 0x0000000000000000 <+0>: callq 0x5 <down_read_trylock+5> 0x0000000000000005 <+5>: xor %eax,%eax 0x0000000000000007 <+7>: lea 0x1(%rax),%rdx 0x000000000000000b <+11>: lock cmpxchg %rdx,(%rdi) 0x0000000000000010 <+16>: jne 0x29 <down_read_trylock+41> 0x0000000000000012 <+18>: mov %gs:0x0,%rax 0x000000000000001b <+27>: or $0x3,%rax 0x000000000000001f <+31>: mov %rax,0x20(%rdi) 0x0000000000000023 <+35>: mov $0x1,%eax 0x0000000000000028 <+40>: retq 0x0000000000000029 <+41>: test %rax,%rax 0x000000000000002c <+44>: jns 0x7 <down_read_trylock+7> 0x000000000000002e <+46>: xor %eax,%eax 0x0000000000000030 <+48>: retq By using a rwsem microbenchmark, the down_read_trylock() rate (with a load of 10 to lengthen the lock critical section) on a x86-64 system before and after the patch were: Before Patch After Patch # of Threads rlock rlock ------------ ----- ----- 1 14,496 14,716 2 8,644 8,453 4 6,799 6,983 8 5,664 7,190 On a ARM64 system, the performance results were: Before Patch After Patch # of Threads rlock rlock ------------ ----- ----- 1 23,676 24,488 2 7,697 9,502 4 4,945 3,440 8 2,641 1,603 For the uncontended case (1 thread), the new down_read_trylock() is a little bit faster. For the contended cases, the new down_read_trylock() perform pretty well in x86-64, but performance degrades at high contention level on ARM64. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Waiman Long <longman@redhat.com> --- kernel/locking/rwsem.h | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)