Message ID | 1549566446-27967-1-git-send-email-longman@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | locking/rwsem: Rework rwsem-xadd & enable new rwsem features | expand |
On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote: > On 32-bit architectures, there aren't enough bits to hold both. > 64-bit architectures, however, can have enough bits to do that. For > x86-64, the physical address can use up to 52 bits. That is 4PB of > memory. That leaves 12 bits available for other use. The task structure > pointer is also aligned to the L1 cache size. That means another 6 bits > (64 bytes cacheline) will be available. Reserving 2 bits for status > flags, we will have 16 bits for the reader count. That can supports > up to (64k-1) readers. *groan*... So take qrwlock's idea for a queue, then make the count value (similar to the new mutex); that is have a bit0 be a r/w bit, when w bits 6-N are owner, when r they are reader-count. bit1 can be a pending bit, bit2 a handoff bit etc.. That should fit and work on 32bit and 64bit without issue. I have a half-arsed rwsem-atomic.c somewhere that does just that. I just never got around to doing all the optimistic spin and steal crap that makes our current rwsem fly. And that nicely gets rid of that mind bending BIAS crud.
On Thu, 07 Feb 2019, Waiman Long wrote:
> 30 files changed, 1197 insertions(+), 1594 deletions(-)
Performance numbers on numerous workloads, pretty please.
I'll go and throw this at my mmap_sem intensive workloads
I've collected.
Thanks,
Davidlohr
On 02/07/2019 02:45 PM, Peter Zijlstra wrote: > On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote: >> On 32-bit architectures, there aren't enough bits to hold both. >> 64-bit architectures, however, can have enough bits to do that. For >> x86-64, the physical address can use up to 52 bits. That is 4PB of >> memory. That leaves 12 bits available for other use. The task structure >> pointer is also aligned to the L1 cache size. That means another 6 bits >> (64 bytes cacheline) will be available. Reserving 2 bits for status >> flags, we will have 16 bits for the reader count. That can supports >> up to (64k-1) readers. > *groan*... > > So take qrwlock's idea for a queue, then make the count value (similar > to the new mutex); that is have a bit0 be a r/w bit, when w bits 6-N are > owner, when r they are reader-count. bit1 can be a pending bit, bit2 a > handoff bit etc.. > > That should fit and work on 32bit and 64bit without issue. > > I have a half-arsed rwsem-atomic.c somewhere that does just that. I just > never got around to doing all the optimistic spin and steal crap that > makes our current rwsem fly. > > And that nicely gets rid of that mind bending BIAS crud. Well, the reason for this compromise is to keep using xadd for readers. Your scheme will certainly work, but we have to use cmpxchg for readers too. That will have a performance impact especially with multiple readers contending which I am trying to avoid. Cheers, Longman
On 02/07/2019 02:51 PM, Davidlohr Bueso wrote: > On Thu, 07 Feb 2019, Waiman Long wrote: >> 30 files changed, 1197 insertions(+), 1594 deletions(-) > > Performance numbers on numerous workloads, pretty please. > > I'll go and throw this at my mmap_sem intensive workloads > I've collected. > > Thanks, > Davidlohr Thanks for getting some of the performance numbers. This is the initial draft after more than 1 years of hibernation. I will also get other performance numbers in subsequent revision of the patch. Cheers, Longman
On Thu, Feb 7, 2019 at 11:08 AM Waiman Long <longman@redhat.com> wrote: > > This patchset revamps the current rwsem-xadd implementation to make > it saner and easier to work with. This patchset removes all the > architecture specific assembly code and uses generic C code for all > architectures. This eases maintenance and enables us to enhance the > code more easily. > > This patchset also implements the following 3 new features: > > 1) Waiter lock handoff > 2) Reader optimistic spinning > 3) Store write-lock owner in the atomic count (x86-64 only) The patches are kind of hard to read, with most of them just doing prep-work that doesn't necessarily matter to the big picture. What I'd really like to see is (a) an overview of the new locking logic (b) what's the new fastpath case (c) some performance numbers to explain the changes from a "this is the point of the whole exercise" standpoint. And yes, I realize that the lock handoff and optimistic spinning is a big deal, since I've seen the same regression numbers that presumably caused this effort to be resurrected. So it's not that I don't find this intriguing and worthwhile, it's literally that I'd like a summary not so much of the individual patches, but of the new model. Please? Linus
On 02/08/2019 02:50 PM, Linus Torvalds wrote: > On Thu, Feb 7, 2019 at 11:08 AM Waiman Long <longman@redhat.com> wrote: >> This patchset revamps the current rwsem-xadd implementation to make >> it saner and easier to work with. This patchset removes all the >> architecture specific assembly code and uses generic C code for all >> architectures. This eases maintenance and enables us to enhance the >> code more easily. >> >> This patchset also implements the following 3 new features: >> >> 1) Waiter lock handoff >> 2) Reader optimistic spinning >> 3) Store write-lock owner in the atomic count (x86-64 only) > The patches are kind of hard to read, with most of them just doing > prep-work that doesn't necessarily matter to the big picture. > > What I'd really like to see is > > (a) an overview of the new locking logic The new locking logic is similar to qrwlock (see patch 11). Cmpxchg is used to acquire the write lock, while xadd is still used for read lock. Some of the bits in the count are also reserved for special purpose like has waiter or lock handoff. Patch 15 tries to compress the write-lock owner task pointer and put it into the count field for x86-64 at the expense of less bits available for reader count. I have sent out an additional patch this morning to make sure that the reader count won't overflow. In term of performance, there isn't much change with respect to read-lock performance. For write-lock, I saw a slight drop in some cases, but nothing significant. The merging of owner task pointer into the count field does impose a slightly bigger drop than I would have liked which I am going to look into a bit more. > > (b) what's the new fastpath case The only change in the fastpath is the use of cmpxchg for writer lock. > > (c) some performance numbers There are performance data at patches 11, 12, 15, 19, 20, 21. There was performance data for patch 4 as well for eliminating the arch specific file. Apparently, I might have deleted it accidentally. Anyway, no noticeable performance difference was observed when switching to use generic C code for x86, ppc and ARM64. The major gain in performance is due to reader optimistic spinning patches. The microbenchmark that I used shown an order of magnitude of performance improvement for mixed reader-writer workloads. Of course, we will see less performance gain with real world benchmarks. I am planning to run more performance test and post the data sometimes next week. Davidlohr is also going to run some of his rwsem performance test on this patchset. > > to explain the changes from a "this is the point of the whole > exercise" standpoint. > > And yes, I realize that the lock handoff and optimistic spinning is a > big deal, since I've seen the same regression numbers that presumably > caused this effort to be resurrected. So it's not that I don't find > this intriguing and worthwhile, it's literally that I'd like a summary > not so much of the individual patches, but of the new model. > > Please? Maybe I should break this patchset into a few smaller ones to make it easier to review. Any suggestion is welcome. Cheers, Longman
On Fri, Feb 8, 2019 at 12:31 PM Waiman Long <longman@redhat.com> wrote: > > > (b) what's the new fastpath case > > The only change in the fastpath is the use of cmpxchg for writer lock. .. since a big deal here was about using the generic atomic accessor functions, I really was looking forward to seeing the *actual* fast path code generation. In other words, right now I have very little visibility in how it actually affects the code. Looking at the patches themselves doesn't make it obvious. I was hoping for the overview to really explain the whole "before and after" situation, and it didn't. Not at the high level, and not at a low level. And no performance numbers in the overview either. And yes, I see the numbers in the patches, but what I really hoped for was some real load numbers. In particular, I would have loved to see numbers from th ekernel test robot "will-it-scale.per_thread_ops" case, which is the one that had a 65% regression due to the lack of reader spinning. So I was kind of hoping to hear whether that regression is basically entirely gone with this patch series, or if we still have a regression due to the extra downgrade, or what? Linus
* Waiman Long <longman@redhat.com> wrote: > On 02/07/2019 02:51 PM, Davidlohr Bueso wrote: > > On Thu, 07 Feb 2019, Waiman Long wrote: > >> 30 files changed, 1197 insertions(+), 1594 deletions(-) > > > > Performance numbers on numerous workloads, pretty please. > > > > I'll go and throw this at my mmap_sem intensive workloads > > I've collected. > > > > Thanks, > > Davidlohr > > Thanks for getting some of the performance numbers. This is the initial > draft after more than 1 years of hibernation. I will also get other > performance numbers in subsequent revision of the patch. If you could sort all the invariant preparatory patches to the head of the series I can merge them to reduce overall complexity and simplify performance testing and review of the rest. Thanks, Ingo
Hi all, Kernel test robot reported a will-it-scale.per_thread_ops -64.1% regression on IVB-desktop for v4.20-rc1. The first bad commit is: 9bc8039e715da3b53dbac89525323a9f2f69b7b5, Yang Shi <yang.shi@linux.alibaba.com>: mm: brk: downgrade mmap_sem to read when shrinking (https://lists.01.org/pipermail/lkp/2018-November/009335.html). ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode: gcc-7/performance/x86_64-rhel-7.2/thread/100%/debian-x86_64-2018-04-03.cgz/lkp-ivb-d01/brk1/will-it-scale/0x20 commit: 85a06835f6 ("mm: mremap: downgrade mmap_sem to read when shrinking") 9bc8039e71 ("mm: brk: downgrade mmap_sem to read when shrinking") 85a06835f6f1ba79 9bc8039e715da3b53dbac89525 ---------------- -------------------------- %stddev %change %stddev \ | \ 196250 ± 8% -64.1% 70494 will-it-scale.per_thread_ops 127330 ± 19% -98.0% 2525 ± 24% will-it-scale.time.involuntary_context_switches 727.50 ± 2% -77.0% 167.25 will-it-scale.time.percent_of_cpu_this_job_got 2141 ± 2% -77.6% 479.12 will-it-scale.time.system_time 50.48 ± 7% -48.5% 25.98 will-it-scale.time.user_time 34925294 ± 18% +270.3% 1.293e+08 ± 4% will-it-scale.time.voluntary_context_switches 1570007 ± 8% -64.1% 563958 will-it-scale.workload 6435 ± 2% -6.4% 6024 proc-vmstat.nr_shmem 1298 ± 16% -44.5% 721.00 ± 18% proc-vmstat.pgactivate 2341 +16.4% 2724 slabinfo.kmalloc-96.active_objs 2341 +16.4% 2724 slabinfo.kmalloc-96.num_objs 6346 ±150% -87.8% 776.25 ± 9% softirqs.NET_RX 160107 ± 8% +151.9% 403273 softirqs.SCHED 1097999 -13.0% 955526 softirqs.TIMER 5.50 ± 9% -81.8% 1.00 vmstat.procs.r 230700 ± 19% +269.9% 853292 ± 4% vmstat.system.cs 26706 ± 3% +15.7% 30910 ± 5% vmstat.system.in 11.24 ± 23% +72.2 83.39 mpstat.cpu.idle% 0.00 ±131% +0.0 0.04 ± 99% mpstat.cpu.iowait% 86.32 ± 2% -70.8 15.54 mpstat.cpu.sys% 2.44 ± 7% -1.4 1.04 ± 8% mpstat.cpu.usr% 20610709 ± 15% +2376.0% 5.103e+08 ± 34% cpuidle.C1.time 3233399 ± 8% +241.5% 11042785 ± 25% cpuidle.C1.usage 36172040 ± 6% +931.3% 3.73e+08 ± 15% cpuidle.C1E.time 783605 ± 4% +548.7% 5083041 ± 18% cpuidle.C1E.usage 28753819 ± 39% +1054.5% 3.319e+08 ± 49% cpuidle.C3.time 283912 ± 25% +688.4% 2238225 ± 34% cpuidle.C3.usage 1.507e+08 ± 47% +292.3% 5.913e+08 ± 28% cpuidle.C6.time 339861 ± 37% +549.7% 2208222 ± 24% cpuidle.C6.usage 2709719 ± 5% +824.2% 25043444 cpuidle.POLL.time 28602864 ± 18% +173.7% 78276116 ± 10% cpuidle.POLL.usage We found that the patchset could fix the regression. tests: 1 testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-brk1-ucode=0x20/lkp-ivb-d01 commit: 85a06835f6 ("mm: mremap: downgrade mmap_sem to read when shrinking") fb835fe7f0 ("locking/rwsem: Ensure an RT task will not spin on reader") 85a06835f6f1ba79 fb835fe7f0adbd7c2c074b98ec ---------------- -------------------------- %stddev change %stddev \ | \ 120736 ± 22% 56% 188019 ± 6% will-it-scale.time.involuntary_context_switches 2126 ± 3% 4% 2215 will-it-scale.time.system_time 722 ± 3% 4% 752 will-it-scale.time.percent_of_cpu_this_job_got 36256485 ± 27% -35% 23682989 ± 3% will-it-scale.time.voluntary_context_switches 3151 ± 9% 11% 3504 turbostat.Avg_MHz 229285 ± 32% -30% 160660 ± 3% vmstat.system.cs 120736 ± 22% 56% 188019 ± 6% time.involuntary_context_switches 2126 ± 3% 4% 2215 time.system_time 722 ± 3% 4% 752 time.percent_of_cpu_this_job_got 36256485 ± 27% -35% 23682989 ± 3% time.voluntary_context_switches 23 643% 171 ± 3% proc-vmstat.nr_zone_inactive_file 23 643% 171 ± 3% proc-vmstat.nr_inactive_file 3664 12% 4121 proc-vmstat.nr_kernel_stack 6392 6% 6785 proc-vmstat.nr_slab_unreclaimable 9991 10176 proc-vmstat.nr_slab_reclaimable 63938 62394 proc-vmstat.nr_zone_active_anon 63938 62394 proc-vmstat.nr_active_anon 386388 ± 9% -6% 362272 proc-vmstat.pgfree 368296 ± 9% -10% 333074 proc-vmstat.numa_hit 368296 ± 9% -10% 333074 proc-vmstat.numa_local 5169 ± 13% -28% 3745 proc-vmstat.nr_shmem 1801 ± 21% -83% 309 proc-vmstat.pgactivate 0 1e+04 11441 latency_stats.avg.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 13165 ±222% -1e+04 0 latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup 22499 ±151% -2e+04 657 ± 7% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup 117414 ±181% -9e+04 24418 ± 44% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64 666005 ±218% -7e+05 198 ±141% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup 2600097 ±132% -3e+06 572 latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 34391390 ±150% -3e+07 21807 ±141% latency_stats.avg.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 34624774 ±149% -3e+07 37668 ± 58% latency_stats.avg.max 0 1e+04 11441 latency_stats.max.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 22499 ±151% -2e+04 657 ± 7% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup 37845 ±222% -4e+04 0 latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup 80096 ± 59% -8e+04 0 latency_stats.max.call_rwsem_down_write_failed_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 177149 ±195% -2e+05 24418 ± 44% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64 689417 ±209% -7e+05 200 ±141% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup 18679699 ±129% -2e+07 656 latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 83587334 ±129% -8e+07 43457 ±141% latency_stats.max.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 84867236 ±126% -8e+07 59318 ± 86% latency_stats.max.max 0 1e+04 11441 latency_stats.sum.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 22499 ±151% -2e+04 657 ± 7% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup 39431 ±222% -4e+04 0 latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup 216448 ±200% -2e+05 24418 ± 44% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64 691960 ±208% -7e+05 397 ±141% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup 24239011 ±140% -2e+07 4768 ± 10% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 1.771e+08 ±122% -2e+08 43614 ±141% latency_stats.sum.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.939e+08 ± 36% -2e+08 0 latency_stats.sum.call_rwsem_down_write_failed_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.943e+08 ± 51% -2e+08 51929782 latency_stats.sum.max 407463 ± 10% -100% 0 perf-stat.total.page-faults 74225651 ± 26% -100% 0 perf-stat.total.context-switches 55293 ± 25% -100% 0 perf-stat.total.cpu-migrations 407463 ± 10% -100% 0 perf-stat.total.minor-faults tests: 1 testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-brk1-ucode=0x20/lkp-ivb-d01 commit: 9bc8039e71 ("mm: brk: downgrade mmap_sem to read when shrinking") fb835fe7f0 ("locking/rwsem: Ensure an RT task will not spin on reader") 9bc8039e715da3b5 fb835fe7f0adbd7c2c074b98ec ---------------- -------------------------- %stddev change %stddev \ | \ 3500 ± 36% 5272% 188019 ± 6% will-it-scale.time.involuntary_context_switches 483 358% 2215 will-it-scale.time.system_time 168 346% 752 will-it-scale.time.percent_of_cpu_this_job_got 71190 180% 199232 ± 4% will-it-scale.per_thread_ops 569524 180% 1593862 ± 4% will-it-scale.workload 25.85 93% 49.95 ± 3% will-it-scale.time.user_time 1.314e+08 ± 3% -82% 23682989 ± 3% will-it-scale.time.voluntary_context_switches 30501 ± 9% -15% 25813 ± 4% vmstat.system.in 799593 ± 10% -80% 160660 ± 3% vmstat.system.cs 887 ± 11% 295% 3504 turbostat.Avg_MHz 23.60 ± 10% 68% 39.54 turbostat.CorWatt 28.38 ± 8% 57% 44.43 turbostat.PkgWatt 3500 ± 36% 5272% 188019 ± 6% time.involuntary_context_switches 483 358% 2215 time.system_time 168 346% 752 time.percent_of_cpu_this_job_got 25.85 93% 49.95 ± 3% time.user_time 1.314e+08 ± 3% -82% 23682989 ± 3% time.voluntary_context_switches 0 ± 44% 46220% 386 proc-vmstat.nr_zone_active_file 0 ± 44% 46220% 386 proc-vmstat.nr_active_file 23 643% 171 ± 3% proc-vmstat.nr_zone_inactive_file 23 643% 171 ± 3% proc-vmstat.nr_inactive_file 3690 12% 4121 proc-vmstat.nr_kernel_stack 6419 6% 6785 proc-vmstat.nr_slab_unreclaimable 9961 10176 proc-vmstat.nr_slab_reclaimable 229251 231278 proc-vmstat.nr_zone_unevictable 229251 231278 proc-vmstat.nr_unevictable 1008 1005 proc-vmstat.nr_page_table_pages 63178 62394 proc-vmstat.nr_zone_active_anon 63178 62394 proc-vmstat.nr_active_anon 432061 ± 12% -11% 385372 proc-vmstat.pgfault 408099 ± 10% -11% 362272 proc-vmstat.pgfree 422206 ± 9% -11% 373690 proc-vmstat.pgalloc_normal 382357 ± 11% -13% 333074 proc-vmstat.numa_hit 382357 ± 11% -13% 333074 proc-vmstat.numa_local 4428 ± 17% -15% 3745 proc-vmstat.nr_shmem 0 1e+04 11441 latency_stats.avg.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 11180 ±168% -1e+04 657 ± 7% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup 19239 ±223% -2e+04 0 latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_get_acl.get_acl.posix_acl_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open 63702 ±169% -4e+04 24418 ± 44% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64 77617 ±205% -8e+04 510 ± 11% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 3043762 ±124% -3e+06 572 latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 11630441 ±139% -1e+07 21807 ±141% latency_stats.avg.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 12242832 ±129% -1e+07 37668 ± 58% latency_stats.avg.max 0 1e+04 11441 latency_stats.max.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 11180 ±168% -1e+04 657 ± 7% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup 19239 ±223% -2e+04 0 latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_get_acl.get_acl.posix_acl_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open 29152 ± 11% -3e+04 0 latency_stats.max.call_rwsem_down_write_failed_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 65909 ±164% -4e+04 24418 ± 44% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64 77617 ±205% -8e+04 510 ± 11% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 17301268 ±125% -2e+07 656 latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 44248611 ±140% -4e+07 43457 ±141% latency_stats.max.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 46380610 ±130% -5e+07 59318 ± 86% latency_stats.max.max 0 1e+04 11441 latency_stats.sum.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 11180 ±168% -1e+04 657 ± 7% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup 19239 ±223% -2e+04 0 latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_get_acl.get_acl.posix_acl_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open 74047 ±148% -5e+04 24418 ± 44% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64 77617 ±205% -8e+04 510 ± 11% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 26043088 ±130% -3e+07 4768 ± 10% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 82480038 ±152% -8e+07 43614 ±141% latency_stats.sum.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.771e+09 -2e+09 51929782 latency_stats.sum.max 1.771e+09 -2e+09 0 latency_stats.sum.call_rwsem_down_write_failed_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 420016 ± 12% -100% 0 perf-stat.total.page-faults 2.648e+08 ± 3% -100% 0 perf-stat.total.context-switches 52212 ± 18% -100% 0 perf-stat.total.cpu-migrations 420016 ± 12% -100% 0 perf-stat.total.minor-faults Best Regards, Rong Chen
Ok, those test robot reports are hard to read, but trying to distill it down: On Wed, Feb 13, 2019 at 1:19 AM Chen Rong <rong.a.chen@intel.com> wrote: > > %stddev %change %stddev > \ | \ > 196250 ± 8% -64.1% 70494 will-it-scale.per_thread_ops That's the original 64% regression.. And then with the patch set: > %stddev change %stddev > \ | \ > 71190 180% 199232 ± 4% will-it-scale.per_thread_ops looks like it's back up where it used to be. So I guess we have numbers for the regression now. Thanks. And that closes my biggest question for the new model, and with the new organization that gets ird of the arch-specific asm separately first and makes it a bit more legible that way, I guess I'll just Ack the whole series. Linus
On Fri, 08 Feb 2019, Waiman Long wrote: >I am planning to run more performance test and post the data sometimes >next week. Davidlohr is also going to run some of his rwsem performance >test on this patchset. So I ran this series on a 40-core IB 2 socket with various worklods in mmtests. Below are some of the interesting ones; full numbers and curves at https://linux-scalability.org/rwsem-reader-spinner/ All workloads are with increasing number of threads. -- pagefault timings: pft is an artificial pf benchmark (thus reader stress). metric is faults/cpu and faults/sec v5.0-rc6 v5.0-rc6 dirty Hmean faults/cpu-1 624224.9815 ( 0.00%) 618847.5201 * -0.86%* Hmean faults/cpu-4 539550.3509 ( 0.00%) 547407.5738 * 1.46%* Hmean faults/cpu-7 401470.3461 ( 0.00%) 381157.9830 * -5.06%* Hmean faults/cpu-12 267617.0353 ( 0.00%) 271098.5441 * 1.30%* Hmean faults/cpu-21 176194.4641 ( 0.00%) 175151.3256 * -0.59%* Hmean faults/cpu-30 119927.3862 ( 0.00%) 120610.1348 * 0.57%* Hmean faults/cpu-40 91203.6820 ( 0.00%) 91832.7489 * 0.69%* Hmean faults/sec-1 623292.3467 ( 0.00%) 617992.0795 * -0.85%* Hmean faults/sec-4 2113364.6898 ( 0.00%) 2140254.8238 * 1.27%* Hmean faults/sec-7 2557378.4385 ( 0.00%) 2450945.7060 * -4.16%* Hmean faults/sec-12 2696509.8975 ( 0.00%) 2747968.9819 * 1.91%* Hmean faults/sec-21 2902892.5639 ( 0.00%) 2905923.3881 * 0.10%* Hmean faults/sec-30 2956696.5793 ( 0.00%) 2990583.5147 * 1.15%* Hmean faults/sec-40 3422806.4806 ( 0.00%) 3352970.3082 * -2.04%* Stddev faults/cpu-1 2949.5159 ( 0.00%) 2802.2712 ( 4.99%) Stddev faults/cpu-4 24165.9454 ( 0.00%) 15841.1232 ( 34.45%) Stddev faults/cpu-7 20914.8351 ( 0.00%) 22744.3294 ( -8.75%) Stddev faults/cpu-12 11274.3490 ( 0.00%) 14733.3152 ( -30.68%) Stddev faults/cpu-21 2500.1950 ( 0.00%) 2200.9518 ( 11.97%) Stddev faults/cpu-30 1599.5346 ( 0.00%) 1414.0339 ( 11.60%) Stddev faults/cpu-40 1473.0181 ( 0.00%) 3004.1209 (-103.94%) Stddev faults/sec-1 2655.2581 ( 0.00%) 2405.1625 ( 9.42%) Stddev faults/sec-4 84042.7234 ( 0.00%) 57996.7158 ( 30.99%) Stddev faults/sec-7 123656.7901 ( 0.00%) 135591.1087 ( -9.65%) Stddev faults/sec-12 97135.6091 ( 0.00%) 127054.4926 ( -30.80%) Stddev faults/sec-21 69564.6264 ( 0.00%) 65922.6381 ( 5.24%) Stddev faults/sec-30 51524.4027 ( 0.00%) 56109.4159 ( -8.90%) Stddev faults/sec-40 101927.5280 ( 0.00%) 160117.0093 ( -57.09%) With the exception of the hicup at 7 threads, things are pretty much in the noise region for both metrics. -- git checkout First metric is total runtime for runs with incremental threads. v5.0-rc6 v5.0-rc6 dirty User 218.95 219.07 System 149.29 146.82 Elapsed 1574.10 1427.08 In this case there's a non trivial improvement (11%) in overall elapsed time. -- reaim (which is always succeptible to rwsem changes for both mmap_sem and i_mmap) v5.0-rc6 v5.0-rc6 dirty Hmean compute-1 6674.01 ( 0.00%) 6544.28 * -1.94%* Hmean compute-21 85294.91 ( 0.00%) 85524.20 * 0.27%* Hmean compute-41 149674.70 ( 0.00%) 149494.58 * -0.12%* Hmean compute-61 177721.15 ( 0.00%) 170507.38 * -4.06%* Hmean compute-81 181531.07 ( 0.00%) 180463.24 * -0.59%* Hmean compute-101 189024.09 ( 0.00%) 187288.86 * -0.92%* Hmean compute-121 200673.24 ( 0.00%) 195327.65 * -2.66%* Hmean compute-141 213082.29 ( 0.00%) 211290.80 * -0.84%* Hmean compute-161 207764.06 ( 0.00%) 204626.68 * -1.51%* The 'compute' workload overall takes a small hit. Hmean new_dbase-1 60.48 ( 0.00%) 60.63 * 0.25%* Hmean new_dbase-21 6590.49 ( 0.00%) 6671.81 * 1.23%* Hmean new_dbase-41 14202.91 ( 0.00%) 14470.59 * 1.88%* Hmean new_dbase-61 21207.24 ( 0.00%) 21067.40 * -0.66%* Hmean new_dbase-81 25542.40 ( 0.00%) 25542.40 * 0.00%* Hmean new_dbase-101 30165.28 ( 0.00%) 30046.21 * -0.39%* Hmean new_dbase-121 33638.33 ( 0.00%) 33219.90 * -1.24%* Hmean new_dbase-141 36723.70 ( 0.00%) 37504.52 * 2.13%* Hmean new_dbase-161 42242.51 ( 0.00%) 42117.34 * -0.30%* Hmean shared-1 76.54 ( 0.00%) 76.09 * -0.59%* Hmean shared-21 7535.51 ( 0.00%) 5518.75 * -26.76%* Hmean shared-41 17207.81 ( 0.00%) 14651.94 * -14.85%* Hmean shared-61 20716.98 ( 0.00%) 18667.52 * -9.89%* Hmean shared-81 27603.83 ( 0.00%) 23466.45 * -14.99%* Hmean shared-101 26008.59 ( 0.00%) 29536.96 * 13.57%* Hmean shared-121 28354.76 ( 0.00%) 43139.39 * 52.14%* Hmean shared-141 38509.25 ( 0.00%) 41619.35 * 8.08%* Hmean shared-161 40496.07 ( 0.00%) 44303.46 * 9.40%* Overall there is a small hit (in the noise level but consistent throughout many workloads), except git-checkout which does quite well. Thanks, Davidlohr
On 02/14/2019 08:23 AM, Davidlohr Bueso wrote: > On Fri, 08 Feb 2019, Waiman Long wrote: >> I am planning to run more performance test and post the data sometimes >> next week. Davidlohr is also going to run some of his rwsem performance >> test on this patchset. > > So I ran this series on a 40-core IB 2 socket with various worklods in > mmtests. Below are some of the interesting ones; full numbers and curves > at https://linux-scalability.org/rwsem-reader-spinner/ > > All workloads are with increasing number of threads. > > -- pagefault timings: pft is an artificial pf benchmark (thus reader > stress). > metric is faults/cpu and faults/sec > v5.0-rc6 v5.0-rc6 > dirty > Hmean faults/cpu-1 624224.9815 ( 0.00%) 618847.5201 * -0.86%* > Hmean faults/cpu-4 539550.3509 ( 0.00%) 547407.5738 * 1.46%* > Hmean faults/cpu-7 401470.3461 ( 0.00%) 381157.9830 * -5.06%* > Hmean faults/cpu-12 267617.0353 ( 0.00%) 271098.5441 * 1.30%* > Hmean faults/cpu-21 176194.4641 ( 0.00%) 175151.3256 * -0.59%* > Hmean faults/cpu-30 119927.3862 ( 0.00%) 120610.1348 * 0.57%* > Hmean faults/cpu-40 91203.6820 ( 0.00%) 91832.7489 * 0.69%* > Hmean faults/sec-1 623292.3467 ( 0.00%) 617992.0795 * -0.85%* > Hmean faults/sec-4 2113364.6898 ( 0.00%) 2140254.8238 * 1.27%* > Hmean faults/sec-7 2557378.4385 ( 0.00%) 2450945.7060 * -4.16%* > Hmean faults/sec-12 2696509.8975 ( 0.00%) 2747968.9819 * 1.91%* > Hmean faults/sec-21 2902892.5639 ( 0.00%) 2905923.3881 * 0.10%* > Hmean faults/sec-30 2956696.5793 ( 0.00%) 2990583.5147 * 1.15%* > Hmean faults/sec-40 3422806.4806 ( 0.00%) 3352970.3082 * -2.04%* > Stddev faults/cpu-1 2949.5159 ( 0.00%) 2802.2712 ( 4.99%) > Stddev faults/cpu-4 24165.9454 ( 0.00%) 15841.1232 ( 34.45%) > Stddev faults/cpu-7 20914.8351 ( 0.00%) 22744.3294 ( -8.75%) > Stddev faults/cpu-12 11274.3490 ( 0.00%) 14733.3152 ( -30.68%) > Stddev faults/cpu-21 2500.1950 ( 0.00%) 2200.9518 ( 11.97%) > Stddev faults/cpu-30 1599.5346 ( 0.00%) 1414.0339 ( 11.60%) > Stddev faults/cpu-40 1473.0181 ( 0.00%) 3004.1209 (-103.94%) > Stddev faults/sec-1 2655.2581 ( 0.00%) 2405.1625 ( 9.42%) > Stddev faults/sec-4 84042.7234 ( 0.00%) 57996.7158 ( 30.99%) > Stddev faults/sec-7 123656.7901 ( 0.00%) 135591.1087 ( -9.65%) > Stddev faults/sec-12 97135.6091 ( 0.00%) 127054.4926 ( -30.80%) > Stddev faults/sec-21 69564.6264 ( 0.00%) 65922.6381 ( 5.24%) > Stddev faults/sec-30 51524.4027 ( 0.00%) 56109.4159 ( -8.90%) > Stddev faults/sec-40 101927.5280 ( 0.00%) 160117.0093 ( -57.09%) > > With the exception of the hicup at 7 threads, things are pretty much in > the noise region for both metrics. > > -- git checkout > > First metric is total runtime for runs with incremental threads. > > v5.0-rc6 v5.0-rc6 > dirty > User 218.95 219.07 > System 149.29 146.82 > Elapsed 1574.10 1427.08 > > In this case there's a non trivial improvement (11%) in overall > elapsed time. > > -- reaim (which is always succeptible to rwsem changes for both > mmap_sem and > i_mmap) > v5.0-rc6 v5.0-rc6 > dirty > Hmean compute-1 6674.01 ( 0.00%) 6544.28 * -1.94%* > Hmean compute-21 85294.91 ( 0.00%) 85524.20 * 0.27%* > Hmean compute-41 149674.70 ( 0.00%) 149494.58 * -0.12%* > Hmean compute-61 177721.15 ( 0.00%) 170507.38 * -4.06%* > Hmean compute-81 181531.07 ( 0.00%) 180463.24 * -0.59%* > Hmean compute-101 189024.09 ( 0.00%) 187288.86 * -0.92%* > Hmean compute-121 200673.24 ( 0.00%) 195327.65 * -2.66%* > Hmean compute-141 213082.29 ( 0.00%) 211290.80 * -0.84%* > Hmean compute-161 207764.06 ( 0.00%) 204626.68 * -1.51%* > > The 'compute' workload overall takes a small hit. > > Hmean new_dbase-1 60.48 ( 0.00%) 60.63 * 0.25%* > Hmean new_dbase-21 6590.49 ( 0.00%) 6671.81 * 1.23%* > Hmean new_dbase-41 14202.91 ( 0.00%) 14470.59 * 1.88%* > Hmean new_dbase-61 21207.24 ( 0.00%) 21067.40 * -0.66%* > Hmean new_dbase-81 25542.40 ( 0.00%) 25542.40 * 0.00%* > Hmean new_dbase-101 30165.28 ( 0.00%) 30046.21 * -0.39%* > Hmean new_dbase-121 33638.33 ( 0.00%) 33219.90 * -1.24%* > Hmean new_dbase-141 36723.70 ( 0.00%) 37504.52 * 2.13%* > Hmean new_dbase-161 42242.51 ( 0.00%) 42117.34 * -0.30%* > Hmean shared-1 76.54 ( 0.00%) 76.09 * -0.59%* > Hmean shared-21 7535.51 ( 0.00%) 5518.75 * -26.76%* > Hmean shared-41 17207.81 ( 0.00%) 14651.94 * -14.85%* > Hmean shared-61 20716.98 ( 0.00%) 18667.52 * -9.89%* > Hmean shared-81 27603.83 ( 0.00%) 23466.45 * -14.99%* > Hmean shared-101 26008.59 ( 0.00%) 29536.96 * 13.57%* > Hmean shared-121 28354.76 ( 0.00%) 43139.39 * 52.14%* > Hmean shared-141 38509.25 ( 0.00%) 41619.35 * 8.08%* > Hmean shared-161 40496.07 ( 0.00%) 44303.46 * 9.40%* > > Overall there is a small hit (in the noise level but consistent > throughout > many workloads), except git-checkout which does quite well. > > Thanks, > Davidlohr Thanks for running the patch through your performance tests. Cheers, Longman
Hi, Waiman, What's the status of this patchset? And its merging plan? Best Regards, Huang, Ying
On 04/10/2019 04:15 AM, huang ying wrote: > Hi, Waiman, > > What's the status of this patchset? And its merging plan? > > Best Regards, > Huang, Ying I have broken the patch into 3 parts (0/1/2) and rewritten some of them. Part 0 has been merged into tip. Parts 1 and 2 are still under testing. Cheers, Longman
On Thu, Apr 11, 2019 at 12:08 AM Waiman Long <longman@redhat.com> wrote: > > On 04/10/2019 04:15 AM, huang ying wrote: > > Hi, Waiman, > > > > What's the status of this patchset? And its merging plan? > > > > Best Regards, > > Huang, Ying > > I have broken the patch into 3 parts (0/1/2) and rewritten some of them. > Part 0 has been merged into tip. Parts 1 and 2 are still under testing. Thanks! Please keep me updated! Best Regards, Huang, Ying > Cheers, > Longman >