Message ID | 20180604095520.8563-8-xiaoguangrong@tencent.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote: > From: Xiao Guangrong <xiaoguangrong@tencent.com> > > Try to hold src_page_req_mutex only if the queue is not > empty Pure question: how much this patch would help? Basically if you are running compression tests then I think it means you are with precopy (since postcopy cannot work with compression yet), then here the lock has no contention at all. Regards,
On 06/19/2018 03:36 PM, Peter Xu wrote: > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote: >> From: Xiao Guangrong <xiaoguangrong@tencent.com> >> >> Try to hold src_page_req_mutex only if the queue is not >> empty > > Pure question: how much this patch would help? Basically if you are > running compression tests then I think it means you are with precopy > (since postcopy cannot work with compression yet), then here the lock > has no contention at all. Yes, you are right, however we can observe it is in the top functions (after revert this patch): Samples: 29K of event 'cycles', Event count (approx.): 22263412260 + 7.99% kqemu qemu-system-x86_64 [.] ram_bytes_total + 6.95% kqemu [kernel.kallsyms] [k] copy_user_enhanced_fast_string + 6.23% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file + 6.20% kqemu qemu-system-x86_64 [.] qemu_event_set + 5.80% kqemu qemu-system-x86_64 [.] __ring_put + 4.82% kqemu qemu-system-x86_64 [.] compress_thread_data_done + 4.11% kqemu qemu-system-x86_64 [.] ring_is_full + 3.07% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare + 2.83% kqemu qemu-system-x86_64 [.] ring_mp_get + 2.71% kqemu qemu-system-x86_64 [.] __ring_is_full + 2.46% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 + 2.40% kqemu qemu-system-x86_64 [.] add_to_iovec + 2.21% kqemu qemu-system-x86_64 [.] ring_get + 1.96% kqemu [kernel.kallsyms] [k] __lock_acquire + 1.90% kqemu libc-2.12.so [.] memcpy + 1.55% kqemu qemu-system-x86_64 [.] ring_len + 1.12% kqemu libpthread-2.12.so [.] pthread_mutex_unlock + 1.11% kqemu qemu-system-x86_64 [.] ram_find_and_save_block + 1.07% kqemu qemu-system-x86_64 [.] ram_save_host_page + 1.04% kqemu qemu-system-x86_64 [.] qemu_put_buffer + 0.97% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread + 0.96% kqemu qemu-system-x86_64 [.] ram_save_target_page + 0.93% kqemu libpthread-2.12.so [.] pthread_mutex_lock I guess its atomic operations cost CPU resource and check-before-lock is a common tech, i think it shouldn't have side effect, right? :)
* Xiao Guangrong (guangrong.xiao@gmail.com) wrote: > > > On 06/19/2018 03:36 PM, Peter Xu wrote: > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote: > > > From: Xiao Guangrong <xiaoguangrong@tencent.com> > > > > > > Try to hold src_page_req_mutex only if the queue is not > > > empty > > > > Pure question: how much this patch would help? Basically if you are > > running compression tests then I think it means you are with precopy > > (since postcopy cannot work with compression yet), then here the lock > > has no contention at all. > > Yes, you are right, however we can observe it is in the top functions > (after revert this patch): Can you show the matching trace with the patch in? Dave > Samples: 29K of event 'cycles', Event count (approx.): 22263412260 > + 7.99% kqemu qemu-system-x86_64 [.] ram_bytes_total > + 6.95% kqemu [kernel.kallsyms] [k] copy_user_enhanced_fast_string > + 6.23% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file > + 6.20% kqemu qemu-system-x86_64 [.] qemu_event_set > + 5.80% kqemu qemu-system-x86_64 [.] __ring_put > + 4.82% kqemu qemu-system-x86_64 [.] compress_thread_data_done > + 4.11% kqemu qemu-system-x86_64 [.] ring_is_full > + 3.07% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare > + 2.83% kqemu qemu-system-x86_64 [.] ring_mp_get > + 2.71% kqemu qemu-system-x86_64 [.] __ring_is_full > + 2.46% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 > + 2.40% kqemu qemu-system-x86_64 [.] add_to_iovec > + 2.21% kqemu qemu-system-x86_64 [.] ring_get > + 1.96% kqemu [kernel.kallsyms] [k] __lock_acquire > + 1.90% kqemu libc-2.12.so [.] memcpy > + 1.55% kqemu qemu-system-x86_64 [.] ring_len > + 1.12% kqemu libpthread-2.12.so [.] pthread_mutex_unlock > + 1.11% kqemu qemu-system-x86_64 [.] ram_find_and_save_block > + 1.07% kqemu qemu-system-x86_64 [.] ram_save_host_page > + 1.04% kqemu qemu-system-x86_64 [.] qemu_put_buffer > + 0.97% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread > + 0.96% kqemu qemu-system-x86_64 [.] ram_save_target_page > + 0.93% kqemu libpthread-2.12.so [.] pthread_mutex_lock > > I guess its atomic operations cost CPU resource and check-before-lock is > a common tech, i think it shouldn't have side effect, right? :) > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 06/29/2018 07:22 PM, Dr. David Alan Gilbert wrote: > * Xiao Guangrong (guangrong.xiao@gmail.com) wrote: >> >> >> On 06/19/2018 03:36 PM, Peter Xu wrote: >>> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote: >>>> From: Xiao Guangrong <xiaoguangrong@tencent.com> >>>> >>>> Try to hold src_page_req_mutex only if the queue is not >>>> empty >>> >>> Pure question: how much this patch would help? Basically if you are >>> running compression tests then I think it means you are with precopy >>> (since postcopy cannot work with compression yet), then here the lock >>> has no contention at all. >> >> Yes, you are right, however we can observe it is in the top functions >> (after revert this patch): > > Can you show the matching trace with the patch in? Sure, there is: + 8.38% kqemu [kernel.kallsyms] [k] copy_user_enhanced_fast_string + 8.03% kqemu qemu-system-x86_64 [.] ram_bytes_total + 6.62% kqemu qemu-system-x86_64 [.] qemu_event_set + 6.02% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file + 5.81% kqemu qemu-system-x86_64 [.] __ring_put + 5.04% kqemu qemu-system-x86_64 [.] compress_thread_data_done + 4.48% kqemu qemu-system-x86_64 [.] ring_is_full + 4.44% kqemu qemu-system-x86_64 [.] ring_mp_get + 3.39% kqemu qemu-system-x86_64 [.] __ring_is_full + 2.61% kqemu qemu-system-x86_64 [.] add_to_iovec + 2.48% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare + 2.08% kqemu libc-2.12.so [.] memcpy + 2.07% kqemu qemu-system-x86_64 [.] ring_len + 1.91% kqemu [kernel.kallsyms] [k] __lock_acquire + 1.60% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 + 1.16% kqemu qemu-system-x86_64 [.] ram_find_and_save_block + 1.14% kqemu qemu-system-x86_64 [.] ram_save_target_page + 1.12% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread + 1.09% kqemu qemu-system-x86_64 [.] ram_save_host_page + 1.07% kqemu qemu-system-x86_64 [.] test_and_clear_bit + 1.07% kqemu qemu-system-x86_64 [.] qemu_put_buffer + 1.03% kqemu qemu-system-x86_64 [.] qemu_put_byte + 0.80% kqemu qemu-system-x86_64 [.] threads_submit_request_commit + 0.74% kqemu qemu-system-x86_64 [.] migration_bitmap_clear_dirty + 0.70% kqemu qemu-system-x86_64 [.] control_save_page + 0.69% kqemu qemu-system-x86_64 [.] test_bit + 0.69% kqemu qemu-system-x86_64 [.] ram_save_iterate + 0.63% kqemu qemu-system-x86_64 [.] migration_bitmap_find_dirty + 0.63% kqemu qemu-system-x86_64 [.] ram_control_save_page + 0.62% kqemu qemu-system-x86_64 [.] rcu_read_lock + 0.56% kqemu qemu-system-x86_64 [.] qemu_file_get_error + 0.55% kqemu [kernel.kallsyms] [k] lock_acquire + 0.55% kqemu qemu-system-x86_64 [.] find_dirty_block + 0.54% kqemu qemu-system-x86_64 [.] ring_index + 0.53% kqemu qemu-system-x86_64 [.] ring_put + 0.51% kqemu qemu-system-x86_64 [.] unqueue_page + 0.50% kqemu qemu-system-x86_64 [.] migrate_use_compression + 0.48% kqemu qemu-system-x86_64 [.] get_queued_page + 0.46% kqemu qemu-system-x86_64 [.] ring_get + 0.46% kqemu [i40e] [k] i40e_clean_tx_irq + 0.45% kqemu [kernel.kallsyms] [k] lock_release + 0.44% kqemu [kernel.kallsyms] [k] native_sched_clock + 0.38% kqemu qemu-system-x86_64 [.] migrate_get_current + 0.38% kqemu [kernel.kallsyms] [k] find_held_lock + 0.34% kqemu [kernel.kallsyms] [k] __lock_release + 0.34% kqemu qemu-system-x86_64 [.] qemu_ram_pagesize + 0.29% kqemu [kernel.kallsyms] [k] lock_is_held_type + 0.27% kqemu [kernel.kallsyms] [k] update_load_avg + 0.27% kqemu qemu-system-x86_64 [.] save_page_use_compression + 0.24% kqemu qemu-system-x86_64 [.] qemu_file_rate_limit + 0.23% kqemu [kernel.kallsyms] [k] tcp_sendmsg + 0.23% kqemu [kernel.kallsyms] [k] match_held_lock + 0.22% kqemu [kernel.kallsyms] [k] do_raw_spin_trylock + 0.22% kqemu [kernel.kallsyms] [k] cyc2ns_read_begin
On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote: > > > On 06/19/2018 03:36 PM, Peter Xu wrote: > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote: > > > From: Xiao Guangrong <xiaoguangrong@tencent.com> > > > > > > Try to hold src_page_req_mutex only if the queue is not > > > empty > > > > Pure question: how much this patch would help? Basically if you are > > running compression tests then I think it means you are with precopy > > (since postcopy cannot work with compression yet), then here the lock > > has no contention at all. > > Yes, you are right, however we can observe it is in the top functions > (after revert this patch): > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260 > + 7.99% kqemu qemu-system-x86_64 [.] ram_bytes_total > + 6.95% kqemu [kernel.kallsyms] [k] copy_user_enhanced_fast_string > + 6.23% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file > + 6.20% kqemu qemu-system-x86_64 [.] qemu_event_set > + 5.80% kqemu qemu-system-x86_64 [.] __ring_put > + 4.82% kqemu qemu-system-x86_64 [.] compress_thread_data_done > + 4.11% kqemu qemu-system-x86_64 [.] ring_is_full > + 3.07% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare > + 2.83% kqemu qemu-system-x86_64 [.] ring_mp_get > + 2.71% kqemu qemu-system-x86_64 [.] __ring_is_full > + 2.46% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 > + 2.40% kqemu qemu-system-x86_64 [.] add_to_iovec > + 2.21% kqemu qemu-system-x86_64 [.] ring_get > + 1.96% kqemu [kernel.kallsyms] [k] __lock_acquire > + 1.90% kqemu libc-2.12.so [.] memcpy > + 1.55% kqemu qemu-system-x86_64 [.] ring_len > + 1.12% kqemu libpthread-2.12.so [.] pthread_mutex_unlock > + 1.11% kqemu qemu-system-x86_64 [.] ram_find_and_save_block > + 1.07% kqemu qemu-system-x86_64 [.] ram_save_host_page > + 1.04% kqemu qemu-system-x86_64 [.] qemu_put_buffer > + 0.97% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread > + 0.96% kqemu qemu-system-x86_64 [.] ram_save_target_page > + 0.93% kqemu libpthread-2.12.so [.] pthread_mutex_lock (sorry to respond late; I was busy with other stuff for the release...) I am trying to find out anything related to unqueue_page() but I failed. Did I miss anything obvious there? > > I guess its atomic operations cost CPU resource and check-before-lock is > a common tech, i think it shouldn't have side effect, right? :) Yeah it makes sense to me. :) Thanks,
On 07/11/2018 04:21 PM, Peter Xu wrote: > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote: >> >> >> On 06/19/2018 03:36 PM, Peter Xu wrote: >>> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote: >>>> From: Xiao Guangrong <xiaoguangrong@tencent.com> >>>> >>>> Try to hold src_page_req_mutex only if the queue is not >>>> empty >>> >>> Pure question: how much this patch would help? Basically if you are >>> running compression tests then I think it means you are with precopy >>> (since postcopy cannot work with compression yet), then here the lock >>> has no contention at all. >> >> Yes, you are right, however we can observe it is in the top functions >> (after revert this patch): >> >> Samples: 29K of event 'cycles', Event count (approx.): 22263412260 >> + 7.99% kqemu qemu-system-x86_64 [.] ram_bytes_total >> + 6.95% kqemu [kernel.kallsyms] [k] copy_user_enhanced_fast_string >> + 6.23% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file >> + 6.20% kqemu qemu-system-x86_64 [.] qemu_event_set >> + 5.80% kqemu qemu-system-x86_64 [.] __ring_put >> + 4.82% kqemu qemu-system-x86_64 [.] compress_thread_data_done >> + 4.11% kqemu qemu-system-x86_64 [.] ring_is_full >> + 3.07% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare >> + 2.83% kqemu qemu-system-x86_64 [.] ring_mp_get >> + 2.71% kqemu qemu-system-x86_64 [.] __ring_is_full >> + 2.46% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 >> + 2.40% kqemu qemu-system-x86_64 [.] add_to_iovec >> + 2.21% kqemu qemu-system-x86_64 [.] ring_get >> + 1.96% kqemu [kernel.kallsyms] [k] __lock_acquire >> + 1.90% kqemu libc-2.12.so [.] memcpy >> + 1.55% kqemu qemu-system-x86_64 [.] ring_len >> + 1.12% kqemu libpthread-2.12.so [.] pthread_mutex_unlock >> + 1.11% kqemu qemu-system-x86_64 [.] ram_find_and_save_block >> + 1.07% kqemu qemu-system-x86_64 [.] ram_save_host_page >> + 1.04% kqemu qemu-system-x86_64 [.] qemu_put_buffer >> + 0.97% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread >> + 0.96% kqemu qemu-system-x86_64 [.] ram_save_target_page >> + 0.93% kqemu libpthread-2.12.so [.] pthread_mutex_lock > > (sorry to respond late; I was busy with other stuff for the > release...) > You're welcome. :) > I am trying to find out anything related to unqueue_page() but I > failed. Did I miss anything obvious there? > unqueue_page() was not listed here indeed, i think the function itself is light enough (a check then directly return) so it did not leave a trace here. This perf data was got after reverting this patch, i.e, it's based on the lockless multithread model, then unqueue_page() is the only place using mutex in the main thread. And you can see the overload of mutext was gone after applying this patch in the mail i replied to Dave.
On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote: > > > On 07/11/2018 04:21 PM, Peter Xu wrote: > > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote: > > > > > > > > > On 06/19/2018 03:36 PM, Peter Xu wrote: > > > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote: > > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com> > > > > > > > > > > Try to hold src_page_req_mutex only if the queue is not > > > > > empty > > > > > > > > Pure question: how much this patch would help? Basically if you are > > > > running compression tests then I think it means you are with precopy > > > > (since postcopy cannot work with compression yet), then here the lock > > > > has no contention at all. > > > > > > Yes, you are right, however we can observe it is in the top functions > > > (after revert this patch): > > > > > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260 > > > + 7.99% kqemu qemu-system-x86_64 [.] ram_bytes_total > > > + 6.95% kqemu [kernel.kallsyms] [k] copy_user_enhanced_fast_string > > > + 6.23% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file > > > + 6.20% kqemu qemu-system-x86_64 [.] qemu_event_set > > > + 5.80% kqemu qemu-system-x86_64 [.] __ring_put > > > + 4.82% kqemu qemu-system-x86_64 [.] compress_thread_data_done > > > + 4.11% kqemu qemu-system-x86_64 [.] ring_is_full > > > + 3.07% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare > > > + 2.83% kqemu qemu-system-x86_64 [.] ring_mp_get > > > + 2.71% kqemu qemu-system-x86_64 [.] __ring_is_full > > > + 2.46% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 > > > + 2.40% kqemu qemu-system-x86_64 [.] add_to_iovec > > > + 2.21% kqemu qemu-system-x86_64 [.] ring_get > > > + 1.96% kqemu [kernel.kallsyms] [k] __lock_acquire > > > + 1.90% kqemu libc-2.12.so [.] memcpy > > > + 1.55% kqemu qemu-system-x86_64 [.] ring_len > > > + 1.12% kqemu libpthread-2.12.so [.] pthread_mutex_unlock > > > + 1.11% kqemu qemu-system-x86_64 [.] ram_find_and_save_block > > > + 1.07% kqemu qemu-system-x86_64 [.] ram_save_host_page > > > + 1.04% kqemu qemu-system-x86_64 [.] qemu_put_buffer > > > + 0.97% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread > > > + 0.96% kqemu qemu-system-x86_64 [.] ram_save_target_page > > > + 0.93% kqemu libpthread-2.12.so [.] pthread_mutex_lock > > > > (sorry to respond late; I was busy with other stuff for the > > release...) > > > > You're welcome. :) > > > I am trying to find out anything related to unqueue_page() but I > > failed. Did I miss anything obvious there? > > > > unqueue_page() was not listed here indeed, i think the function > itself is light enough (a check then directly return) so it > did not leave a trace here. > > This perf data was got after reverting this patch, i.e, it's > based on the lockless multithread model, then unqueue_page() is > the only place using mutex in the main thread. > > And you can see the overload of mutext was gone after applying > this patch in the mail i replied to Dave. I see. It's not a big portion of CPU resource, though of course I don't have reason to object to this change as well. Actually what interested me more is why ram_bytes_total() is such a hot spot. AFAIU it's only called in ram_find_and_save_block() per call, and it should be mostly a constant if we don't plug/unplug memories. Not sure that means that's a better spot to work on. Thanks,
* Xiao Guangrong (guangrong.xiao@gmail.com) wrote: > > > On 07/11/2018 04:21 PM, Peter Xu wrote: > > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote: > > > > > > > > > On 06/19/2018 03:36 PM, Peter Xu wrote: > > > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote: > > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com> > > > > > > > > > > Try to hold src_page_req_mutex only if the queue is not > > > > > empty > > > > > > > > Pure question: how much this patch would help? Basically if you are > > > > running compression tests then I think it means you are with precopy > > > > (since postcopy cannot work with compression yet), then here the lock > > > > has no contention at all. > > > > > > Yes, you are right, however we can observe it is in the top functions > > > (after revert this patch): > > > > > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260 > > > + 7.99% kqemu qemu-system-x86_64 [.] ram_bytes_total > > > + 6.95% kqemu [kernel.kallsyms] [k] copy_user_enhanced_fast_string > > > + 6.23% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file > > > + 6.20% kqemu qemu-system-x86_64 [.] qemu_event_set > > > + 5.80% kqemu qemu-system-x86_64 [.] __ring_put > > > + 4.82% kqemu qemu-system-x86_64 [.] compress_thread_data_done > > > + 4.11% kqemu qemu-system-x86_64 [.] ring_is_full > > > + 3.07% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare > > > + 2.83% kqemu qemu-system-x86_64 [.] ring_mp_get > > > + 2.71% kqemu qemu-system-x86_64 [.] __ring_is_full > > > + 2.46% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 > > > + 2.40% kqemu qemu-system-x86_64 [.] add_to_iovec > > > + 2.21% kqemu qemu-system-x86_64 [.] ring_get > > > + 1.96% kqemu [kernel.kallsyms] [k] __lock_acquire > > > + 1.90% kqemu libc-2.12.so [.] memcpy > > > + 1.55% kqemu qemu-system-x86_64 [.] ring_len > > > + 1.12% kqemu libpthread-2.12.so [.] pthread_mutex_unlock > > > + 1.11% kqemu qemu-system-x86_64 [.] ram_find_and_save_block > > > + 1.07% kqemu qemu-system-x86_64 [.] ram_save_host_page > > > + 1.04% kqemu qemu-system-x86_64 [.] qemu_put_buffer > > > + 0.97% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread > > > + 0.96% kqemu qemu-system-x86_64 [.] ram_save_target_page > > > + 0.93% kqemu libpthread-2.12.so [.] pthread_mutex_lock > > > > (sorry to respond late; I was busy with other stuff for the > > release...) > > > > You're welcome. :) > > > I am trying to find out anything related to unqueue_page() but I > > failed. Did I miss anything obvious there? > > > > unqueue_page() was not listed here indeed, i think the function > itself is light enough (a check then directly return) so it > did not leave a trace here. > > This perf data was got after reverting this patch, i.e, it's > based on the lockless multithread model, then unqueue_page() is > the only place using mutex in the main thread. > > And you can see the overload of mutext was gone after applying > this patch in the mail i replied to Dave. I got around to playing with this patch and using 'perf top' to see what was going on. What I noticed was that without this patch pthread_mutex_unlock and qemu_mutex_lock_impl were both noticeable; with the patch they'd pretty mich vanished. So I think it's worth it. I ocouldn't honestly see a difference in total CPU usage or bandwidth; but the migration code is so spiky in usage that it's difficult to measure anyway. So, Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 07/12/2018 04:26 PM, Peter Xu wrote: > On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote: >> >> >> On 07/11/2018 04:21 PM, Peter Xu wrote: >>> On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote: >>>> >>>> >>>> On 06/19/2018 03:36 PM, Peter Xu wrote: >>>>> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote: >>>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com> >>>>>> >>>>>> Try to hold src_page_req_mutex only if the queue is not >>>>>> empty >>>>> >>>>> Pure question: how much this patch would help? Basically if you are >>>>> running compression tests then I think it means you are with precopy >>>>> (since postcopy cannot work with compression yet), then here the lock >>>>> has no contention at all. >>>> >>>> Yes, you are right, however we can observe it is in the top functions >>>> (after revert this patch): >>>> >>>> Samples: 29K of event 'cycles', Event count (approx.): 22263412260 >>>> + 7.99% kqemu qemu-system-x86_64 [.] ram_bytes_total >>>> + 6.95% kqemu [kernel.kallsyms] [k] copy_user_enhanced_fast_string >>>> + 6.23% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file >>>> + 6.20% kqemu qemu-system-x86_64 [.] qemu_event_set >>>> + 5.80% kqemu qemu-system-x86_64 [.] __ring_put >>>> + 4.82% kqemu qemu-system-x86_64 [.] compress_thread_data_done >>>> + 4.11% kqemu qemu-system-x86_64 [.] ring_is_full >>>> + 3.07% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare >>>> + 2.83% kqemu qemu-system-x86_64 [.] ring_mp_get >>>> + 2.71% kqemu qemu-system-x86_64 [.] __ring_is_full >>>> + 2.46% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 >>>> + 2.40% kqemu qemu-system-x86_64 [.] add_to_iovec >>>> + 2.21% kqemu qemu-system-x86_64 [.] ring_get >>>> + 1.96% kqemu [kernel.kallsyms] [k] __lock_acquire >>>> + 1.90% kqemu libc-2.12.so [.] memcpy >>>> + 1.55% kqemu qemu-system-x86_64 [.] ring_len >>>> + 1.12% kqemu libpthread-2.12.so [.] pthread_mutex_unlock >>>> + 1.11% kqemu qemu-system-x86_64 [.] ram_find_and_save_block >>>> + 1.07% kqemu qemu-system-x86_64 [.] ram_save_host_page >>>> + 1.04% kqemu qemu-system-x86_64 [.] qemu_put_buffer >>>> + 0.97% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread >>>> + 0.96% kqemu qemu-system-x86_64 [.] ram_save_target_page >>>> + 0.93% kqemu libpthread-2.12.so [.] pthread_mutex_lock >>> >>> (sorry to respond late; I was busy with other stuff for the >>> release...) >>> >> >> You're welcome. :) >> >>> I am trying to find out anything related to unqueue_page() but I >>> failed. Did I miss anything obvious there? >>> >> >> unqueue_page() was not listed here indeed, i think the function >> itself is light enough (a check then directly return) so it >> did not leave a trace here. >> >> This perf data was got after reverting this patch, i.e, it's >> based on the lockless multithread model, then unqueue_page() is >> the only place using mutex in the main thread. >> >> And you can see the overload of mutext was gone after applying >> this patch in the mail i replied to Dave. > > I see. It's not a big portion of CPU resource, though of course I > don't have reason to object to this change as well. > > Actually what interested me more is why ram_bytes_total() is such a > hot spot. AFAIU it's only called in ram_find_and_save_block() per > call, and it should be mostly a constant if we don't plug/unplug > memories. Not sure that means that's a better spot to work on. > I noticed it too. That could be another work we will work on. :)
On Wed, Jul 18, 2018 at 04:56:13PM +0800, Xiao Guangrong wrote: > > > On 07/12/2018 04:26 PM, Peter Xu wrote: > > On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote: > > > > > > > > > On 07/11/2018 04:21 PM, Peter Xu wrote: > > > > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote: > > > > > > > > > > > > > > > On 06/19/2018 03:36 PM, Peter Xu wrote: > > > > > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote: > > > > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com> > > > > > > > > > > > > > > Try to hold src_page_req_mutex only if the queue is not > > > > > > > empty > > > > > > > > > > > > Pure question: how much this patch would help? Basically if you are > > > > > > running compression tests then I think it means you are with precopy > > > > > > (since postcopy cannot work with compression yet), then here the lock > > > > > > has no contention at all. > > > > > > > > > > Yes, you are right, however we can observe it is in the top functions > > > > > (after revert this patch): > > > > > > > > > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260 > > > > > + 7.99% kqemu qemu-system-x86_64 [.] ram_bytes_total > > > > > + 6.95% kqemu [kernel.kallsyms] [k] copy_user_enhanced_fast_string > > > > > + 6.23% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file > > > > > + 6.20% kqemu qemu-system-x86_64 [.] qemu_event_set > > > > > + 5.80% kqemu qemu-system-x86_64 [.] __ring_put > > > > > + 4.82% kqemu qemu-system-x86_64 [.] compress_thread_data_done > > > > > + 4.11% kqemu qemu-system-x86_64 [.] ring_is_full > > > > > + 3.07% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare > > > > > + 2.83% kqemu qemu-system-x86_64 [.] ring_mp_get > > > > > + 2.71% kqemu qemu-system-x86_64 [.] __ring_is_full > > > > > + 2.46% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 > > > > > + 2.40% kqemu qemu-system-x86_64 [.] add_to_iovec > > > > > + 2.21% kqemu qemu-system-x86_64 [.] ring_get > > > > > + 1.96% kqemu [kernel.kallsyms] [k] __lock_acquire > > > > > + 1.90% kqemu libc-2.12.so [.] memcpy > > > > > + 1.55% kqemu qemu-system-x86_64 [.] ring_len > > > > > + 1.12% kqemu libpthread-2.12.so [.] pthread_mutex_unlock > > > > > + 1.11% kqemu qemu-system-x86_64 [.] ram_find_and_save_block > > > > > + 1.07% kqemu qemu-system-x86_64 [.] ram_save_host_page > > > > > + 1.04% kqemu qemu-system-x86_64 [.] qemu_put_buffer > > > > > + 0.97% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread > > > > > + 0.96% kqemu qemu-system-x86_64 [.] ram_save_target_page > > > > > + 0.93% kqemu libpthread-2.12.so [.] pthread_mutex_lock > > > > > > > > (sorry to respond late; I was busy with other stuff for the > > > > release...) > > > > > > > > > > You're welcome. :) > > > > > > > I am trying to find out anything related to unqueue_page() but I > > > > failed. Did I miss anything obvious there? > > > > > > > > > > unqueue_page() was not listed here indeed, i think the function > > > itself is light enough (a check then directly return) so it > > > did not leave a trace here. > > > > > > This perf data was got after reverting this patch, i.e, it's > > > based on the lockless multithread model, then unqueue_page() is > > > the only place using mutex in the main thread. > > > > > > And you can see the overload of mutext was gone after applying > > > this patch in the mail i replied to Dave. > > > > I see. It's not a big portion of CPU resource, though of course I > > don't have reason to object to this change as well. > > > > Actually what interested me more is why ram_bytes_total() is such a > > hot spot. AFAIU it's only called in ram_find_and_save_block() per > > call, and it should be mostly a constant if we don't plug/unplug > > memories. Not sure that means that's a better spot to work on. > > > > I noticed it too. That could be another work we will work on. :) Yeah I'm looking forward to that. :) Btw, please feel free to post the common codes separately in your next post if the series grows even bigger (that may help even for no-compression migrations), I would bet they'll have a better chance to be reviewed and merged faster (though it'll possibly be after QEMU 3.0). Regards,
diff --git a/include/qemu/queue.h b/include/qemu/queue.h index 59fd1203a1..ac418efc43 100644 --- a/include/qemu/queue.h +++ b/include/qemu/queue.h @@ -341,6 +341,7 @@ struct { \ /* * Simple queue access methods. */ +#define QSIMPLEQ_EMPTY_ATOMIC(head) (atomic_read(&((head)->sqh_first)) == NULL) #define QSIMPLEQ_EMPTY(head) ((head)->sqh_first == NULL) #define QSIMPLEQ_FIRST(head) ((head)->sqh_first) #define QSIMPLEQ_NEXT(elm, field) ((elm)->field.sqe_next) diff --git a/migration/ram.c b/migration/ram.c index 15b20d3f70..f9a8646520 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1550,6 +1550,10 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset) { RAMBlock *block = NULL; + if (QSIMPLEQ_EMPTY_ATOMIC(&rs->src_page_requests)) { + return NULL; + } + qemu_mutex_lock(&rs->src_page_req_mutex); if (!QSIMPLEQ_EMPTY(&rs->src_page_requests)) { struct RAMSrcPageRequest *entry =