[06/12] migration: do not detect zero page for compression

Message ID	20180604095520.8563-7-xiaoguangrong@tencent.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: guangrong.xiao@gmail.com To: pbonzini@redhat.com, mst@redhat.com, mtosatti@redhat.com Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org, dgilbert@redhat.com, peterx@redhat.com, jiang.biao2@zte.com.cn, wei.w.wang@intel.com, Xiao Guangrong <xiaoguangrong@tencent.com> Subject: [PATCH 06/12] migration: do not detect zero page for compression Date: Mon, 4 Jun 2018 17:55:14 +0800 Message-Id: <20180604095520.8563-7-xiaoguangrong@tencent.com> In-Reply-To: <20180604095520.8563-1-xiaoguangrong@tencent.com> References: <20180604095520.8563-1-xiaoguangrong@tencent.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Xiao Guangrong June 4, 2018, 9:55 a.m. UTC

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Detecting zero page is not a light work, we can disable it
for compression that can handle all zero data very well

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 44 +++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

Peter Xu June 19, 2018, 7:30 a.m. UTC | #1

On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Detecting zero page is not a light work, we can disable it
> for compression that can handle all zero data very well

Is there any number shows how the compression algo performs better
than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
be fast, depending on how init_accel() is done in util/bufferiszero.c.

From compression rate POV of course zero page algo wins since it
contains no data (but only a flag).

Regards,

Xiao Guangrong June 28, 2018, 9:12 a.m. UTC | #2

Hi Peter,

Sorry for the delay as i was busy on other things.

On 06/19/2018 03:30 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Detecting zero page is not a light work, we can disable it
>> for compression that can handle all zero data very well
> 
> Is there any number shows how the compression algo performs better
> than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> be fast, depending on how init_accel() is done in util/bufferiszero.c.

This is the comparison between zero-detection and compression (the target
buffer is all zero bit):

Zero 810 ns Compression: 26905 ns.
Zero 417 ns Compression: 8022 ns.
Zero 408 ns Compression: 7189 ns.
Zero 400 ns Compression: 7255 ns.
Zero 412 ns Compression: 7016 ns.
Zero 411 ns Compression: 7035 ns.
Zero 413 ns Compression: 6994 ns.
Zero 399 ns Compression: 7024 ns.
Zero 416 ns Compression: 7053 ns.
Zero 405 ns Compression: 7041 ns.

Indeed, zero-detection is faster than compression.

However during our profiling for the live_migration thread (after reverted this patch),
we noticed zero-detection cost lots of CPU:

  12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
   7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
   6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
   5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
   5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
   4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
   4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
   3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
   2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
   2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
   2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
   2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
   1.96%  kqemu  libc-2.12.so                  [.] memcpy

After this patch, the workload is moved to the worker thread, is it
acceptable?

> 
>  From compression rate POV of course zero page algo wins since it
> contains no data (but only a flag).
> 

Yes it is. The compressed zero page is 45 bytes that is small enough i think.

Hmm, if you do not like, how about move detecting zero page to the work thread?

Thanks!

Daniel P. Berrangé June 28, 2018, 9:36 a.m. UTC | #3

On Thu, Jun 28, 2018 at 05:12:39PM +0800, Xiao Guangrong wrote:
> 
> Hi Peter,
> 
> Sorry for the delay as i was busy on other things.
> 
> On 06/19/2018 03:30 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Detecting zero page is not a light work, we can disable it
> > > for compression that can handle all zero data very well
> > 
> > Is there any number shows how the compression algo performs better
> > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> 
> This is the comparison between zero-detection and compression (the target
> buffer is all zero bit):
> 
> Zero 810 ns Compression: 26905 ns.
> Zero 417 ns Compression: 8022 ns.
> Zero 408 ns Compression: 7189 ns.
> Zero 400 ns Compression: 7255 ns.
> Zero 412 ns Compression: 7016 ns.
> Zero 411 ns Compression: 7035 ns.
> Zero 413 ns Compression: 6994 ns.
> Zero 399 ns Compression: 7024 ns.
> Zero 416 ns Compression: 7053 ns.
> Zero 405 ns Compression: 7041 ns.
> 
> Indeed, zero-detection is faster than compression.
> 
> However during our profiling for the live_migration thread (after reverted this patch),
> we noticed zero-detection cost lots of CPU:
> 
>  12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
>   7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
>   6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
>   5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
>   5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
>   4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
>   4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
>   3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
>   2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
>   2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
>   2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
>   2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
>   1.96%  kqemu  libc-2.12.so                  [.] memcpy
> 
> After this patch, the workload is moved to the worker thread, is it
> acceptable?

It depends on your point of view. If you have spare / idle CPUs on the host,
then moving workload to a thread is ok, despite the CPU cost of compression
in that thread being much higher what what was replaced, since you won't be
taking CPU resources away from other contending workloads.

I'd venture to suggest though that we should probably *not* be optimizing for
the case of idle CPUs on the host. More realistic is to expect that the host
CPUs are near fully committed to work, and thus the (default) goal should be
to minimize CPU overhead for the host as a whole. From this POV, zero-page
detection is better than compression due to > x10 better speed.

Given the CPU overheads of compression, I think it has fairly narrow use
in migration in general when considering hosts are often highly committed
on CPU.

Regards,
Daniel

Dr. David Alan Gilbert June 29, 2018, 9:42 a.m. UTC | #4

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> Hi Peter,
> 
> Sorry for the delay as i was busy on other things.
> 
> On 06/19/2018 03:30 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Detecting zero page is not a light work, we can disable it
> > > for compression that can handle all zero data very well
> > 
> > Is there any number shows how the compression algo performs better
> > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> 
> This is the comparison between zero-detection and compression (the target
> buffer is all zero bit):
> 
> Zero 810 ns Compression: 26905 ns.
> Zero 417 ns Compression: 8022 ns.
> Zero 408 ns Compression: 7189 ns.
> Zero 400 ns Compression: 7255 ns.
> Zero 412 ns Compression: 7016 ns.
> Zero 411 ns Compression: 7035 ns.
> Zero 413 ns Compression: 6994 ns.
> Zero 399 ns Compression: 7024 ns.
> Zero 416 ns Compression: 7053 ns.
> Zero 405 ns Compression: 7041 ns.
> 
> Indeed, zero-detection is faster than compression.
> 
> However during our profiling for the live_migration thread (after reverted this patch),
> we noticed zero-detection cost lots of CPU:
> 
>  12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆

Interesting; what host are you running on?
Some hosts have support for the faster buffer_zero_ss4/avx2

>   7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
>   6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
>   5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
>   5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
>   4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
>   4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
>   3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
>   2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
>   2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
>   2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
>   2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
>   1.96%  kqemu  libc-2.12.so                  [.] memcpy
> 
> After this patch, the workload is moved to the worker thread, is it
> acceptable?
> 
> > 
> >  From compression rate POV of course zero page algo wins since it
> > contains no data (but only a flag).
> > 
> 
> Yes it is. The compressed zero page is 45 bytes that is small enough i think.

So the compression is ~20x slow and 10x the size;  not a great
improvement!

However, the tricky thing is that in the case of a guest which is mostly
non-zero, this patch would save that time used by zero detection, so it
would be faster.

> Hmm, if you do not like, how about move detecting zero page to the work thread?

That would be interesting to try.

Dave

> Thanks!
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Dr. David Alan Gilbert June 29, 2018, 9:54 a.m. UTC | #5

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Thu, Jun 28, 2018 at 05:12:39PM +0800, Xiao Guangrong wrote:
> > 
> > Hi Peter,
> > 
> > Sorry for the delay as i was busy on other things.
> > 
> > On 06/19/2018 03:30 PM, Peter Xu wrote:
> > > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > 
> > > > Detecting zero page is not a light work, we can disable it
> > > > for compression that can handle all zero data very well
> > > 
> > > Is there any number shows how the compression algo performs better
> > > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> > 
> > This is the comparison between zero-detection and compression (the target
> > buffer is all zero bit):
> > 
> > Zero 810 ns Compression: 26905 ns.
> > Zero 417 ns Compression: 8022 ns.
> > Zero 408 ns Compression: 7189 ns.
> > Zero 400 ns Compression: 7255 ns.
> > Zero 412 ns Compression: 7016 ns.
> > Zero 411 ns Compression: 7035 ns.
> > Zero 413 ns Compression: 6994 ns.
> > Zero 399 ns Compression: 7024 ns.
> > Zero 416 ns Compression: 7053 ns.
> > Zero 405 ns Compression: 7041 ns.
> > 
> > Indeed, zero-detection is faster than compression.
> > 
> > However during our profiling for the live_migration thread (after reverted this patch),
> > we noticed zero-detection cost lots of CPU:
> > 
> >  12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
> >   7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
> >   6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
> >   5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
> >   5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
> >   4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
> >   4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
> >   3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
> >   2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
> >   2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
> >   2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
> >   2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
> >   1.96%  kqemu  libc-2.12.so                  [.] memcpy
> > 
> > After this patch, the workload is moved to the worker thread, is it
> > acceptable?
> 
> It depends on your point of view. If you have spare / idle CPUs on the host,
> then moving workload to a thread is ok, despite the CPU cost of compression
> in that thread being much higher what what was replaced, since you won't be
> taking CPU resources away from other contending workloads.

It depends on teh VM as well; if the VM is mostly non-zero, the zero
checks happen and are over head (although if the pages are non-zero then
the zero check will mostly happen much faster unless you're unlucky and
the non-zero byte is the last one on the page).

> I'd venture to suggest though that we should probably *not* be optimizing for
> the case of idle CPUs on the host. More realistic is to expect that the host
> CPUs are near fully committed to work, and thus the (default) goal should be
> to minimize CPU overhead for the host as a whole. From this POV, zero-page
> detection is better than compression due to > x10 better speed.

Note that this is only happening if compression is enabled.

> Given the CPU overheads of compression, I think it has fairly narrow use
> in migration in general when considering hosts are often highly committed
> on CPU.

Also, this compression series was originally written by Intel for the
case where there's a compression accelerator hardware (that I've never
found to try); in that case I guess it saves that CPU overhead.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Xiao Guangrong July 3, 2018, 3:53 a.m. UTC | #6

On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>> Hi Peter,
>>
>> Sorry for the delay as i was busy on other things.
>>
>> On 06/19/2018 03:30 PM, Peter Xu wrote:
>>> On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> Detecting zero page is not a light work, we can disable it
>>>> for compression that can handle all zero data very well
>>>
>>> Is there any number shows how the compression algo performs better
>>> than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
>>> be fast, depending on how init_accel() is done in util/bufferiszero.c.
>>
>> This is the comparison between zero-detection and compression (the target
>> buffer is all zero bit):
>>
>> Zero 810 ns Compression: 26905 ns.
>> Zero 417 ns Compression: 8022 ns.
>> Zero 408 ns Compression: 7189 ns.
>> Zero 400 ns Compression: 7255 ns.
>> Zero 412 ns Compression: 7016 ns.
>> Zero 411 ns Compression: 7035 ns.
>> Zero 413 ns Compression: 6994 ns.
>> Zero 399 ns Compression: 7024 ns.
>> Zero 416 ns Compression: 7053 ns.
>> Zero 405 ns Compression: 7041 ns.
>>
>> Indeed, zero-detection is faster than compression.
>>
>> However during our profiling for the live_migration thread (after reverted this patch),
>> we noticed zero-detection cost lots of CPU:
>>
>>   12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
> 
> Interesting; what host are you running on?
> Some hosts have support for the faster buffer_zero_ss4/avx2

The host is:

model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
...
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
  mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
  rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
  ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
  tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
  cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
  hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
  clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
  cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke

I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
version:
    gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
    glibc.x86_64                     2.12


> 
>>    7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
>>    6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
>>    5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
>>    5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
>>    4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
>>    4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
>>    3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
>>    2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
>>    2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
>>    2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
>>    2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
>>    1.96%  kqemu  libc-2.12.so                  [.] memcpy
>>
>> After this patch, the workload is moved to the worker thread, is it
>> acceptable?
>>
>>>
>>>   From compression rate POV of course zero page algo wins since it
>>> contains no data (but only a flag).
>>>
>>
>> Yes it is. The compressed zero page is 45 bytes that is small enough i think.
> 
> So the compression is ~20x slow and 10x the size;  not a great
> improvement!
> 
> However, the tricky thing is that in the case of a guest which is mostly
> non-zero, this patch would save that time used by zero detection, so it
> would be faster.

Yes, indeed.

> 
>> Hmm, if you do not like, how about move detecting zero page to the work thread?
> 
> That would be interesting to try.
> 

Okay, i will try it then. :)

Dr. David Alan Gilbert July 16, 2018, 6:58 p.m. UTC | #7

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
> > * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> > > 
> > > Hi Peter,
> > > 
> > > Sorry for the delay as i was busy on other things.
> > > 
> > > On 06/19/2018 03:30 PM, Peter Xu wrote:
> > > > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > 
> > > > > Detecting zero page is not a light work, we can disable it
> > > > > for compression that can handle all zero data very well
> > > > 
> > > > Is there any number shows how the compression algo performs better
> > > > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > > > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> > > 
> > > This is the comparison between zero-detection and compression (the target
> > > buffer is all zero bit):
> > > 
> > > Zero 810 ns Compression: 26905 ns.
> > > Zero 417 ns Compression: 8022 ns.
> > > Zero 408 ns Compression: 7189 ns.
> > > Zero 400 ns Compression: 7255 ns.
> > > Zero 412 ns Compression: 7016 ns.
> > > Zero 411 ns Compression: 7035 ns.
> > > Zero 413 ns Compression: 6994 ns.
> > > Zero 399 ns Compression: 7024 ns.
> > > Zero 416 ns Compression: 7053 ns.
> > > Zero 405 ns Compression: 7041 ns.
> > > 
> > > Indeed, zero-detection is faster than compression.
> > > 
> > > However during our profiling for the live_migration thread (after reverted this patch),
> > > we noticed zero-detection cost lots of CPU:
> > > 
> > >   12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
> > 
> > Interesting; what host are you running on?
> > Some hosts have support for the faster buffer_zero_ss4/avx2
> 
> The host is:
> 
> model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
> ...
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
>  mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
>  rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
>  ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
>  tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
>  cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
>  hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
>  clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
>  cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
> 
> I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
> version:
>    gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
>    glibc.x86_64                     2.12

Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.

> 
> > 
> > >    7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
> > >    6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
> > >    5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
> > >    5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
> > >    4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
> > >    4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
> > >    3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
> > >    2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
> > >    2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
> > >    2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
> > >    2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
> > >    1.96%  kqemu  libc-2.12.so                  [.] memcpy
> > > 
> > > After this patch, the workload is moved to the worker thread, is it
> > > acceptable?
> > > 
> > > > 
> > > >   From compression rate POV of course zero page algo wins since it
> > > > contains no data (but only a flag).
> > > > 
> > > 
> > > Yes it is. The compressed zero page is 45 bytes that is small enough i think.
> > 
> > So the compression is ~20x slow and 10x the size;  not a great
> > improvement!
> > 
> > However, the tricky thing is that in the case of a guest which is mostly
> > non-zero, this patch would save that time used by zero detection, so it
> > would be faster.
> 
> Yes, indeed.

It would be good to benchmark the performance difference for a guest
with mostly non-zero pages; you should see a useful improvement.

Dave

> > 
> > > Hmm, if you do not like, how about move detecting zero page to the work thread?
> > 
> > That would be interesting to try.
> > 
> 
> Okay, i will try it then. :)
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Xiao Guangrong July 18, 2018, 8:46 a.m. UTC | #8

On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>>
>> On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
>>> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> Sorry for the delay as i was busy on other things.
>>>>
>>>> On 06/19/2018 03:30 PM, Peter Xu wrote:
>>>>> On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
>>>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>>>
>>>>>> Detecting zero page is not a light work, we can disable it
>>>>>> for compression that can handle all zero data very well
>>>>>
>>>>> Is there any number shows how the compression algo performs better
>>>>> than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
>>>>> be fast, depending on how init_accel() is done in util/bufferiszero.c.
>>>>
>>>> This is the comparison between zero-detection and compression (the target
>>>> buffer is all zero bit):
>>>>
>>>> Zero 810 ns Compression: 26905 ns.
>>>> Zero 417 ns Compression: 8022 ns.
>>>> Zero 408 ns Compression: 7189 ns.
>>>> Zero 400 ns Compression: 7255 ns.
>>>> Zero 412 ns Compression: 7016 ns.
>>>> Zero 411 ns Compression: 7035 ns.
>>>> Zero 413 ns Compression: 6994 ns.
>>>> Zero 399 ns Compression: 7024 ns.
>>>> Zero 416 ns Compression: 7053 ns.
>>>> Zero 405 ns Compression: 7041 ns.
>>>>
>>>> Indeed, zero-detection is faster than compression.
>>>>
>>>> However during our profiling for the live_migration thread (after reverted this patch),
>>>> we noticed zero-detection cost lots of CPU:
>>>>
>>>>    12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
>>>
>>> Interesting; what host are you running on?
>>> Some hosts have support for the faster buffer_zero_ss4/avx2
>>
>> The host is:
>>
>> model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
>> ...
>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
>>   mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
>>   rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
>>   ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
>>   tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
>>   cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
>>   hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
>>   clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
>>   cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
>>
>> I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
>> version:
>>     gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
>>     glibc.x86_64                     2.12
> 
> Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.

Er, it is not easy to update glibc in the production env.... :(

Michael S. Tsirkin July 22, 2018, 4:05 p.m. UTC | #9

On Wed, Jul 18, 2018 at 04:46:21PM +0800, Xiao Guangrong wrote:
> 
> 
> On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote:
> > * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> > > 
> > > 
> > > On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
> > > > * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> > > > > 
> > > > > Hi Peter,
> > > > > 
> > > > > Sorry for the delay as i was busy on other things.
> > > > > 
> > > > > On 06/19/2018 03:30 PM, Peter Xu wrote:
> > > > > > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > > > 
> > > > > > > Detecting zero page is not a light work, we can disable it
> > > > > > > for compression that can handle all zero data very well
> > > > > > 
> > > > > > Is there any number shows how the compression algo performs better
> > > > > > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > > > > > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> > > > > 
> > > > > This is the comparison between zero-detection and compression (the target
> > > > > buffer is all zero bit):
> > > > > 
> > > > > Zero 810 ns Compression: 26905 ns.
> > > > > Zero 417 ns Compression: 8022 ns.
> > > > > Zero 408 ns Compression: 7189 ns.
> > > > > Zero 400 ns Compression: 7255 ns.
> > > > > Zero 412 ns Compression: 7016 ns.
> > > > > Zero 411 ns Compression: 7035 ns.
> > > > > Zero 413 ns Compression: 6994 ns.
> > > > > Zero 399 ns Compression: 7024 ns.
> > > > > Zero 416 ns Compression: 7053 ns.
> > > > > Zero 405 ns Compression: 7041 ns.
> > > > > 
> > > > > Indeed, zero-detection is faster than compression.
> > > > > 
> > > > > However during our profiling for the live_migration thread (after reverted this patch),
> > > > > we noticed zero-detection cost lots of CPU:
> > > > > 
> > > > >    12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
> > > > 
> > > > Interesting; what host are you running on?
> > > > Some hosts have support for the faster buffer_zero_ss4/avx2
> > > 
> > > The host is:
> > > 
> > > model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
> > > ...
> > > flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
> > >   mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
> > >   rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
> > >   ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
> > >   tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
> > >   cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
> > >   hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
> > >   clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
> > >   cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
> > > 
> > > I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
> > > version:
> > >     gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
> > >     glibc.x86_64                     2.12
> > 
> > Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.
> 
> Er, it is not easy to update glibc in the production env.... :(

But neither is QEMU updated in production all that easily. While we do
want to support older hosts functionally, it does not make
much sense to devel complex optimizations that only benefit
older hosts.

Xiao Guangrong July 23, 2018, 7:12 a.m. UTC | #10

On 07/23/2018 12:05 AM, Michael S. Tsirkin wrote:
> On Wed, Jul 18, 2018 at 04:46:21PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote:
>>> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>>>
>>>>
>>>> On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
>>>>> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> Sorry for the delay as i was busy on other things.
>>>>>>
>>>>>> On 06/19/2018 03:30 PM, Peter Xu wrote:
>>>>>>> On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
>>>>>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>>>>>
>>>>>>>> Detecting zero page is not a light work, we can disable it
>>>>>>>> for compression that can handle all zero data very well
>>>>>>>
>>>>>>> Is there any number shows how the compression algo performs better
>>>>>>> than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
>>>>>>> be fast, depending on how init_accel() is done in util/bufferiszero.c.
>>>>>>
>>>>>> This is the comparison between zero-detection and compression (the target
>>>>>> buffer is all zero bit):
>>>>>>
>>>>>> Zero 810 ns Compression: 26905 ns.
>>>>>> Zero 417 ns Compression: 8022 ns.
>>>>>> Zero 408 ns Compression: 7189 ns.
>>>>>> Zero 400 ns Compression: 7255 ns.
>>>>>> Zero 412 ns Compression: 7016 ns.
>>>>>> Zero 411 ns Compression: 7035 ns.
>>>>>> Zero 413 ns Compression: 6994 ns.
>>>>>> Zero 399 ns Compression: 7024 ns.
>>>>>> Zero 416 ns Compression: 7053 ns.
>>>>>> Zero 405 ns Compression: 7041 ns.
>>>>>>
>>>>>> Indeed, zero-detection is faster than compression.
>>>>>>
>>>>>> However during our profiling for the live_migration thread (after reverted this patch),
>>>>>> we noticed zero-detection cost lots of CPU:
>>>>>>
>>>>>>     12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
>>>>>
>>>>> Interesting; what host are you running on?
>>>>> Some hosts have support for the faster buffer_zero_ss4/avx2
>>>>
>>>> The host is:
>>>>
>>>> model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
>>>> ...
>>>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
>>>>    mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
>>>>    rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
>>>>    ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
>>>>    tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
>>>>    cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
>>>>    hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
>>>>    clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
>>>>    cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
>>>>
>>>> I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
>>>> version:
>>>>      gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
>>>>      glibc.x86_64                     2.12
>>>
>>> Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.
>>
>> Er, it is not easy to update glibc in the production env.... :(
> 
> But neither is QEMU updated in production all that easily. While we do
> want to support older hosts functionally, it does not make
> much sense to devel complex optimizations that only benefit
> older hosts.
> 

Can not agree with you more. :)

So i benchmarked in on the production with newer distribution installed.
Here is the data:
  27.48%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string
  12.63%  kqemu  [kernel.kallsyms]             [k] copy_page_rep
  10.82%  kqemu  qemu-system-x86_64            [.] buffer_zero_avx2
   5.69%  kqemu  [kernel.kallsyms]             [k] native_queued_spin_lock_slowpath
   4.61%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare
   4.39%  kqemu  qemu-system-x86_64            [.] qemu_event_set
   4.12%  kqemu  qemu-system-x86_64            [.] ram_find_and_save_block.part.24
   3.61%  kqemu  [kernel.kallsyms]             [k] tcp_sendmsg
   2.62%  kqemu  libc-2.17.so                  [.] __memcpy_ssse3_back
   1.89%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file
   1.32%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done

It does not help...

[06/12] migration: do not detect zero page for compression

Commit Message

Comments

Patch