[10/12] migration: introduce lockless multithreads model

From: Xiao Guangrong <xiaoguangrong@tencent.com>

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Current implementation of compression and decompression are very
hard to be enabled on productions. We noticed that too many wait-wakes
go to kernel space and CPU usages are very low even if the system
is really free

The reasons are:
1) there are two many locks used to do synchronous，there
　　is a global lock and each single thread has its own lock,
　　migration thread and work threads need to go to sleep if
　　these locks are busy

2) migration thread separately submits request to the thread
   however, only one request can be pended, that means, the
   thread has to go to sleep after finishing the request

To make it work better, we introduce a new multithread model,
the user, currently it is the migration thread, submits request
to each thread with round-robin manner, the thread has its own
ring whose capacity is 4 and puts the result to a global ring
which is lockless for multiple producers, the user fetches result
out from the global ring and do remaining operations for the
request, e.g, posting the compressed data out for migration on
the source QEMU

Performance Result:
The test was based on top of the patch:
   ring: introduce lockless ring buffer
that means, previous optimizations are used for both of original case
and applying the new multithread model

We tested live migration on two hosts:
   Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz * 64 + 256G memory
to migration a VM between each other, which has 16 vCPUs and 60G
memory, during the migration, multiple threads are repeatedly writing
the memory in the VM

We used 16 threads on the destination to decompress the data and on the
source, we tried 8 threads and 16 threads to compress the data

--- Before our work ---
migration can not be finished for both 8 threads and 16 threads. The data
is as followings:

Use 8 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       70%          some use 36%, others are very low ~20%
- on the destination:
            main thread        decompress-threads
CPU usage       100%         some use ~40%, other are very low ~2%

Migration status (CAN NOT FINISH):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: active
total time: 1019540 milliseconds
expected downtime: 2263 milliseconds
setup: 218 milliseconds
transferred ram: 252419995 kbytes
throughput: 2469.45 mbps
remaining ram: 15611332 kbytes
total ram: 62931784 kbytes
duplicate: 915323 pages
skipped: 0 pages
normal: 59673047 pages
normal bytes: 238692188 kbytes
dirty sync count: 28
page size: 4 kbytes
dirty pages rate: 170551 pages
compression pages: 121309323 pages
compression busy: 60588337
compression busy rate: 0.36
compression reduced size: 484281967178
compression rate: 0.97

Use 16 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       96%          some use 45%, others are very low ~6%
- on the destination:
            main thread        decompress-threads
CPU usage       96%         some use 58%, other are very low ~10%

Migration status (CAN NOT FINISH):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: active
total time: 1189221 milliseconds
expected downtime: 6824 milliseconds
setup: 220 milliseconds
transferred ram: 90620052 kbytes
throughput: 840.41 mbps
remaining ram: 3678760 kbytes
total ram: 62931784 kbytes
duplicate: 195893 pages
skipped: 0 pages
normal: 17290715 pages
normal bytes: 69162860 kbytes
dirty sync count: 33
page size: 4 kbytes
dirty pages rate: 175039 pages
compression pages: 186739419 pages
compression busy: 17486568
compression busy rate: 0.09
compression reduced size: 744546683892
compression rate: 0.97

--- After our work ---
Migration can be finished quickly for both 8 threads and 16 threads. The
data is as followings:

Use 8 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       30%               30% (all threads have same CPU usage)
- on the destination:
            main thread        decompress-threads
CPU usage       100%              50% (all threads have same CPU usage)

Migration status (finished in 219467 ms):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: completed
total time: 219467 milliseconds
downtime: 115 milliseconds
setup: 222 milliseconds
transferred ram: 88510173 kbytes
throughput: 3303.81 mbps
remaining ram: 0 kbytes
total ram: 62931784 kbytes
duplicate: 2211775 pages
skipped: 0 pages
normal: 21166222 pages
normal bytes: 84664888 kbytes
dirty sync count: 15
page size: 4 kbytes
compression pages: 32045857 pages
compression busy: 23377968
compression busy rate: 0.34
compression reduced size: 127767894329
compression rate: 0.97

Use 16 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       60%               60% (all threads have same CPU usage)
- on the destination:
            main thread        decompress-threads
CPU usage       100%              75% (all threads have same CPU usage)

Migration status (finished in 64118 ms):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: completed
total time: 64118 milliseconds
downtime: 29 milliseconds
setup: 223 milliseconds
transferred ram: 13345135 kbytes
throughput: 1705.10 mbps
remaining ram: 0 kbytes
total ram: 62931784 kbytes
duplicate: 574921 pages
skipped: 0 pages
normal: 2570281 pages
normal bytes: 10281124 kbytes
dirty sync count: 9
page size: 4 kbytes
compression pages: 28007024 pages
compression busy: 3145182
compression busy rate: 0.08
compression reduced size: 111829024985
compression rate: 0.97

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/Makefile.objs |   1 +
 migration/threads.c     | 265 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/threads.h     | 116 +++++++++++++++++++++
 3 files changed, 382 insertions(+)
 create mode 100644 migration/threads.c
 create mode 100644 migration/threads.h

[10/12] migration: introduce lockless multithreads model

Commit Message

Comments

Patch