mbox series

[v3,0/4] Live Migration Acceleration with IAA Compression

Message ID 20240103112851.908082-1-yuan1.liu@intel.com (mailing list archive)
Headers show
Series Live Migration Acceleration with IAA Compression | expand

Message

Yuan Liu Jan. 3, 2024, 11:28 a.m. UTC
Hi,

I am writing to submit a code change aimed at enhancing live migration
acceleration by leveraging the compression capability of the Intel
In-Memory Analytics Accelerator (IAA).

The implementation of the IAA (de)compression code is based on Intel Query
Processing Library (QPL), an open-source software project designed for
IAA high-level software programming. https://github.com/intel/qpl

In the last version, there was some discussion about whether to
introduce a new compression algorithm for IAA. Because the compression
algorithm of IAA hardware is based on deflate, and QPL already supports
Zlib, so in this version, I implemented IAA as an accelerator for the
Zlib compression method. However, due to some reasons, QPL is currently
not compatible with the existing Zlib method that Zlib compressed data
can be decompressed by QPl and vice versa.

I have some concerns about the existing Zlib compression
  1. Will you consider supporting one channel to support multi-stream
     compression? Of course, this may lead to a reduction in compression
     ratio, but it will allow the hardware to process each stream 
     concurrently. We can have each stream process multiple pages,
     reducing the loss of compression ratio. For example, 128 pages are
     divided into 16 streams for independent compression. I will provide
     the a early performance data in the next version(v4).

  2. Will you consider using QPL/IAA as an independent compression
     algorithm instead of an accelerator? In this way, we can better
     utilize hardware performance and some features, such as IAA's
     canned mode, which can be dynamically generated by some statistics
     of data. A huffman table to improve the compression ratio.

Test condition:
  1. Host CPUs are based on Sapphire Rapids, and frequency locked to 3.4G
  2. VM type, 16 vCPU and 64G memory
  3. The Idle workload means no workload is running in the VM 
  4. The Redis workload means YCSB workloadb + Redis Server are running
     in the VM, about 20G or more memory will be used.
  5. Source side migartion configuration commands
     a. migrate_set_capability multifd on
     b. migrate_set_parameter multifd-channels 2/4/8
     c. migrate_set_parameter downtime-limit 300
     d. migrate_set_parameter multifd-compression zlib
     e. migrate_set_parameter multifd-compression-accel none/qpl
     f. migrate_set_parameter max-bandwidth 100G
  6. Desitination side migration configuration commands
     a. migrate_set_capability multifd on
     b. migrate_set_parameter multifd-channels 2/4/8
     c. migrate_set_parameter multifd-compression zlib
     d. migrate_set_parameter multifd-compression-accel none/qpl
     e. migrate_set_parameter max-bandwidth 100G

Early migration result, each result is the average of three tests
 +--------+-------------+--------+--------+---------+----+-----+
 |        | The number  |total   |downtime|network  |pages per |
 |        | of channels |time(ms)|(ms)    |bandwidth|second    |
 |        | and mode    |        |        |(mbps)   |          |
 |        +-------------+-----------------+---------+----------+
 |        | 2 chl, Zlib | 20647  | 22     | 195     | 137767   |
 |        +-------------+--------+--------+---------+----------+
 | Idle   | 2 chl, IAA  | 17022  | 36     | 286     | 460289   |
 |workload+-------------+--------+--------+---------+----------+
 |        | 4 chl, Zlib | 18835  | 29     | 241     | 299028   |
 |        +-------------+--------+--------+---------+----------+
 |        | 4 chl, IAA  | 16280  | 32     | 298     | 652456   |
 |        +-------------+--------+--------+---------+----------+
 |        | 8 chl, Zlib | 17379  | 32     | 275     | 470591   |
 |        +-------------+--------+--------+---------+----------+
 |        | 8 chl, IAA  | 15551  | 46     | 313     | 1315784  |
 +--------+-------------+--------+--------+---------+----------+

 +--------+-------------+--------+--------+---------+----+-----+
 |        | The number  |total   |downtime|network  |pages per |
 |        | of channels |time(ms)|(ms)    |bandwidth|second    |
 |        | and mode    |        |        |(mbps)   |          |
 |        +-------------+-----------------+---------+----------+
 |        | 2 chl, Zlib | 100% failure, timeout is 120s        |
 |        +-------------+--------+--------+---------+----------+
 | Redis  | 2 chl, IAA  | 62737  | 115    | 4547    | 387911   |
 |workload+-------------+--------+--------+---------+----------+
 |        | 4 chl, Zlib | 30% failure, timeout is 120s         |
 |        +-------------+--------+--------+---------+----------+
 |        | 4 chl, IAA  | 54645  | 177    | 5382    | 656865   |
 |        +-------------+--------+--------+---------+----------+
 |        | 8 chl, Zlib | 93488  | 74     | 1264    | 129486   |
 |        +-------------+--------+--------+---------+----------+
 |        | 8 chl, IAA  | 24367  | 303    | 6901    | 964380   |
 +--------+-------------+--------+--------+---------+----------+

v2:       
  - add support for multifd compression accelerator
  - add support for the QPL accelerator in the multifd
    compression accelerator
  - fixed the issue that QPL was compiled into the migration
    module by default

v3:
  - use Meson instead of pkg-config to resolve QPL build
    dependency issue
  - fix coding style
  - fix a CI issue for get_multifd_ops function in multifd.c file

Yuan Liu (4):
  migration: Introduce multifd-compression-accel parameter
  multifd: Implement multifd compression accelerator
  configure: add qpl option
  multifd: Introduce QPL compression accelerator

 hw/core/qdev-properties-system.c    |  11 +
 include/hw/qdev-properties-system.h |   4 +
 meson.build                         |  18 ++
 meson_options.txt                   |   2 +
 migration/meson.build               |   1 +
 migration/migration-hmp-cmds.c      |  10 +
 migration/multifd-qpl.c             | 323 ++++++++++++++++++++++++++++
 migration/multifd.c                 |  40 +++-
 migration/multifd.h                 |   8 +
 migration/options.c                 |  28 +++
 migration/options.h                 |   1 +
 qapi/migration.json                 |  31 ++-
 scripts/meson-buildoptions.sh       |   3 +
 13 files changed, 477 insertions(+), 3 deletions(-)
 create mode 100644 migration/multifd-qpl.c

Comments

Peter Xu Jan. 29, 2024, 10:42 a.m. UTC | #1
On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote:
> Hi,

Hi, Yuan,

I have a few comments and questions.  Many of them can be pure questions as
I don't know enough on these new technologies.

> 
> I am writing to submit a code change aimed at enhancing live migration
> acceleration by leveraging the compression capability of the Intel
> In-Memory Analytics Accelerator (IAA).
> 
> The implementation of the IAA (de)compression code is based on Intel Query
> Processing Library (QPL), an open-source software project designed for
> IAA high-level software programming. https://github.com/intel/qpl
> 
> In the last version, there was some discussion about whether to
> introduce a new compression algorithm for IAA. Because the compression
> algorithm of IAA hardware is based on deflate, and QPL already supports
> Zlib, so in this version, I implemented IAA as an accelerator for the
> Zlib compression method. However, due to some reasons, QPL is currently
> not compatible with the existing Zlib method that Zlib compressed data
> can be decompressed by QPl and vice versa.
> 
> I have some concerns about the existing Zlib compression
>   1. Will you consider supporting one channel to support multi-stream
>      compression? Of course, this may lead to a reduction in compression
>      ratio, but it will allow the hardware to process each stream 
>      concurrently. We can have each stream process multiple pages,
>      reducing the loss of compression ratio. For example, 128 pages are
>      divided into 16 streams for independent compression. I will provide
>      the a early performance data in the next version(v4).

I think Juan used to ask similar question: how much this can help if
multifd can already achieve some form of concurrency over the pages?
Couldn't the user specify more multifd channels if they want to grant more
cpu resource for comp/decomp purpose?

IOW, how many concurrent channels QPL can provide?  What is the suggested
concurrency channels there?

> 
>   2. Will you consider using QPL/IAA as an independent compression
>      algorithm instead of an accelerator? In this way, we can better
>      utilize hardware performance and some features, such as IAA's
>      canned mode, which can be dynamically generated by some statistics
>      of data. A huffman table to improve the compression ratio.

Maybe one more knob will work?  If it's not compatible with the deflate
algo maybe it should never be the default.  IOW, the accelerators may be
extended into this (based on what you already proposed):

  - auto ("qpl" first, "none" second; never "qpl-optimized")
  - none (old zlib)
  - qpl (qpl compatible)
  - qpl-optimized (qpl uncompatible)

Then "auto"/"none"/"qpl" will always be compatible, only the last doesn't,
user can select it explicit, but only on both sides of QEMU.

> 
> Test condition:
>   1. Host CPUs are based on Sapphire Rapids, and frequency locked to 3.4G
>   2. VM type, 16 vCPU and 64G memory
>   3. The Idle workload means no workload is running in the VM 
>   4. The Redis workload means YCSB workloadb + Redis Server are running
>      in the VM, about 20G or more memory will be used.
>   5. Source side migartion configuration commands
>      a. migrate_set_capability multifd on
>      b. migrate_set_parameter multifd-channels 2/4/8
>      c. migrate_set_parameter downtime-limit 300
>      d. migrate_set_parameter multifd-compression zlib
>      e. migrate_set_parameter multifd-compression-accel none/qpl
>      f. migrate_set_parameter max-bandwidth 100G
>   6. Desitination side migration configuration commands
>      a. migrate_set_capability multifd on
>      b. migrate_set_parameter multifd-channels 2/4/8
>      c. migrate_set_parameter multifd-compression zlib
>      d. migrate_set_parameter multifd-compression-accel none/qpl
>      e. migrate_set_parameter max-bandwidth 100G

How is zlib-level setup?  Default (1)?

Btw, it seems both zlib/zstd levels are not even working right now to be
configured.. probably overlooked in migrate_params_apply().

> 
> Early migration result, each result is the average of three tests
>  +--------+-------------+--------+--------+---------+----+-----+
>  |        | The number  |total   |downtime|network  |pages per |
>  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
>  |        | and mode    |        |        |(mbps)   |          |
>  |        +-------------+-----------------+---------+----------+
>  |        | 2 chl, Zlib | 20647  | 22     | 195     | 137767   |
>  |        +-------------+--------+--------+---------+----------+
>  | Idle   | 2 chl, IAA  | 17022  | 36     | 286     | 460289   |
>  |workload+-------------+--------+--------+---------+----------+
>  |        | 4 chl, Zlib | 18835  | 29     | 241     | 299028   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 4 chl, IAA  | 16280  | 32     | 298     | 652456   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 8 chl, Zlib | 17379  | 32     | 275     | 470591   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 8 chl, IAA  | 15551  | 46     | 313     | 1315784  |

The number is slightly confusing to me.  If IAA can send 3x times more
pages per-second, shouldn't the total migration time 1/3 of the other if
the guest is idle?  But the total times seem to be pretty close no matter N
of channels. Maybe I missed something?

>  +--------+-------------+--------+--------+---------+----------+
> 
>  +--------+-------------+--------+--------+---------+----+-----+
>  |        | The number  |total   |downtime|network  |pages per |
>  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
>  |        | and mode    |        |        |(mbps)   |          |
>  |        +-------------+-----------------+---------+----------+
>  |        | 2 chl, Zlib | 100% failure, timeout is 120s        |
>  |        +-------------+--------+--------+---------+----------+
>  | Redis  | 2 chl, IAA  | 62737  | 115    | 4547    | 387911   |
>  |workload+-------------+--------+--------+---------+----------+
>  |        | 4 chl, Zlib | 30% failure, timeout is 120s         |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 4 chl, IAA  | 54645  | 177    | 5382    | 656865   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 8 chl, Zlib | 93488  | 74     | 1264    | 129486   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 8 chl, IAA  | 24367  | 303    | 6901    | 964380   |
>  +--------+-------------+--------+--------+---------+----------+

The redis results look much more preferred on using IAA comparing to the
idle tests.  Does it mean that IAA works less good with zero pages in
general (assuming that'll be the majority in idle test)?

From the manual, I see that IAA also supports encryption/decryption.  Would
it be able to accelerate TLS?

How should one consider IAA over QAT?  What is the major difference?  I see
that IAA requires IOMMU scalable mode, why?  Is it because the IAA HW is
something attached to the pcie bus (assume QAT the same)?

Thanks,
Yuan Liu Jan. 30, 2024, 3:56 a.m. UTC | #2
> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Monday, January 29, 2024 6:43 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou,
> Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA
> Compression
> 
> On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote:
> > Hi,
> 
> Hi, Yuan,
> 
> I have a few comments and questions.  Many of them can be pure questions
> as I don't know enough on these new technologies.
> 
> >
> > I am writing to submit a code change aimed at enhancing live migration
> > acceleration by leveraging the compression capability of the Intel
> > In-Memory Analytics Accelerator (IAA).
> >
> > The implementation of the IAA (de)compression code is based on Intel
> > Query Processing Library (QPL), an open-source software project
> > designed for IAA high-level software programming.
> > https://github.com/intel/qpl
> >
> > In the last version, there was some discussion about whether to
> > introduce a new compression algorithm for IAA. Because the compression
> > algorithm of IAA hardware is based on deflate, and QPL already
> > supports Zlib, so in this version, I implemented IAA as an accelerator
> > for the Zlib compression method. However, due to some reasons, QPL is
> > currently not compatible with the existing Zlib method that Zlib
> > compressed data can be decompressed by QPl and vice versa.
> >
> > I have some concerns about the existing Zlib compression
> >   1. Will you consider supporting one channel to support multi-stream
> >      compression? Of course, this may lead to a reduction in compression
> >      ratio, but it will allow the hardware to process each stream
> >      concurrently. We can have each stream process multiple pages,
> >      reducing the loss of compression ratio. For example, 128 pages are
> >      divided into 16 streams for independent compression. I will provide
> >      the a early performance data in the next version(v4).
> 
> I think Juan used to ask similar question: how much this can help if
> multifd can already achieve some form of concurrency over the pages?


> Couldn't the user specify more multifd channels if they want to grant more
> cpu resource for comp/decomp purpose?
> 
> IOW, how many concurrent channels QPL can provide?  What is the suggested
> concurrency channels there?

From the QPL software, there is no limit on the number of concurrent compression and decompression tasks.
From the IAA hardware, one IAA physical device can process two compressions concurrently or eight decompression tasks concurrently. There are up to 8 IAA devices on an Intel SPR Server and it will vary according to the customer’s product selection and deployment.

Regarding the requirement for the number of concurrent channels, I think this may not be a bottleneck problem.
Please allow me to introduce a little more here

1. If the compression design is based on Zlib/Deflate/Gzip streaming mode, then we indeed need more channels to maintain concurrent processing. Because each time a multifd packet is compressed (including 128 independent pages), it needs to be compressed page by page. These 128 pages are not concurrent. The concurrency is reflected in the logic of multiple channels for the multifd packet.

2. Through testing, we prefer concurrent processing on 4K pages, not multifd packet, which means that 128 pages belonging to a packet can be compressed/decompressed concurrently. Even one channel can also utilize all the resources of IAA. But this is not compatible with existing zlib.
The code is similar to the following
  for(int i = 0; i < num_pages; i++) {
    job[i]->input_data = pages[i]
    submit_job(job[i] //Non-block submit for compression/decompression tasks
  }
  for(int i = 0; i < num_pages; i++) {
    wait_job(job[i])  //busy polling. In the future, we will make this part and data sending into pipeline mode.
  } 

3. Currently, the patches we provide to the community are based on streaming compression. This is to be compatible with the current zlib method. However, we found that there are still many problems with this, so we plan to provide a new change in the next version that the independent QPL/IAA acceleration function as said above.
Compatibility issues include the following
    1. QPL currently does not support the z_sync_flush operation
    2. IAA comp/decomp window is fixed 4K. By default, the zlib window size is 32K. And window size should be the same for Both comp/decomp sides. 
    3. At the same time, I researched the QAT compression scheme. QATzip currently does not support zlib, nor does it support z_sync_flush. The window size is 32K

In general, I think it is a good suggestion to make the accelerator compatible with standard compression algorithms, but also let the accelerator run independently, thus avoiding some compatibility and performance problems of the accelerator. For example, we can add the "accel" option to the compression method, and then the user must specify the same accelerator by compression accelerator parameter on the source and remote ends (just like specifying the same compression algorithm)

> >
> >   2. Will you consider using QPL/IAA as an independent compression
> >      algorithm instead of an accelerator? In this way, we can better
> >      utilize hardware performance and some features, such as IAA's
> >      canned mode, which can be dynamically generated by some statistics
> >      of data. A huffman table to improve the compression ratio.
> 
> Maybe one more knob will work?  If it's not compatible with the deflate
> algo maybe it should never be the default.  IOW, the accelerators may be
> extended into this (based on what you already proposed):
> 
>   - auto ("qpl" first, "none" second; never "qpl-optimized")
>   - none (old zlib)
>   - qpl (qpl compatible)
>   - qpl-optimized (qpl uncompatible)
> 
> Then "auto"/"none"/"qpl" will always be compatible, only the last doesn't,
> user can select it explicit, but only on both sides of QEMU.
Yes, this is what I want, I need a way that QPL is not compatible with zlib. From my current point of view, if zlib chooses raw defalte mode, then QAT will be compatible with the current community's zlib solution.
So my suggestion is as follows

Compression method parameter
 - none
 - zlib
 - zstd
 - accel (Both Qemu sides need to select the same accelerator from "Compression accelerator parameter" explicitly).

Compression accelerator parameter
 - auto
 - none
 - qpl (qpl will not support zlib/zstd, it will inform an error when zlib/zstd is selected)
 - qat (it can provide acceleration of zlib/zstd)

> > Test condition:
> >   1. Host CPUs are based on Sapphire Rapids, and frequency locked to
> 3.4G
> >   2. VM type, 16 vCPU and 64G memory
> >   3. The Idle workload means no workload is running in the VM
> >   4. The Redis workload means YCSB workloadb + Redis Server are running
> >      in the VM, about 20G or more memory will be used.
> >   5. Source side migartion configuration commands
> >      a. migrate_set_capability multifd on
> >      b. migrate_set_parameter multifd-channels 2/4/8
> >      c. migrate_set_parameter downtime-limit 300
> >      d. migrate_set_parameter multifd-compression zlib
> >      e. migrate_set_parameter multifd-compression-accel none/qpl
> >      f. migrate_set_parameter max-bandwidth 100G
> >   6. Desitination side migration configuration commands
> >      a. migrate_set_capability multifd on
> >      b. migrate_set_parameter multifd-channels 2/4/8
> >      c. migrate_set_parameter multifd-compression zlib
> >      d. migrate_set_parameter multifd-compression-accel none/qpl
> >      e. migrate_set_parameter max-bandwidth 100G
> 
> How is zlib-level setup?  Default (1)?
Yes, use level 1 the default level.

> Btw, it seems both zlib/zstd levels are not even working right now to be
> configured.. probably overlooked in migrate_params_apply().
Ok, I will check this.

> > Early migration result, each result is the average of three tests
> > +--------+-------------+--------+--------+---------+----+-----+
> >  |        | The number  |total   |downtime|network  |pages per |
> >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> >  |        | and mode    |        |        |(mbps)   |          |
> >  |        +-------------+-----------------+---------+----------+
> >  |        | 2 chl, Zlib | 20647  | 22     | 195     | 137767   |
> >  |        +-------------+--------+--------+---------+----------+
> >  | Idle   | 2 chl, IAA  | 17022  | 36     | 286     | 460289   |
> >  |workload+-------------+--------+--------+---------+----------+
> >  |        | 4 chl, Zlib | 18835  | 29     | 241     | 299028   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 4 chl, IAA  | 16280  | 32     | 298     | 652456   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 8 chl, Zlib | 17379  | 32     | 275     | 470591   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 8 chl, IAA  | 15551  | 46     | 313     | 1315784  |
> 
> The number is slightly confusing to me.  If IAA can send 3x times more
> pages per-second, shouldn't the total migration time 1/3 of the other if
> the guest is idle?  But the total times seem to be pretty close no matter
> N of channels. Maybe I missed something?

This data is the information read from "info migrate" after the live migration status changes to "complete".
I think it is the max throughout when expected downtime and network available bandwidth are met.
In vCPUs are idle, live migration does not run at maximum throughput for too long.

> >  +--------+-------------+--------+--------+---------+----------+
> >
> >  +--------+-------------+--------+--------+---------+----+-----+
> >  |        | The number  |total   |downtime|network  |pages per |
> >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> >  |        | and mode    |        |        |(mbps)   |          |
> >  |        +-------------+-----------------+---------+----------+
> >  |        | 2 chl, Zlib | 100% failure, timeout is 120s        |
> >  |        +-------------+--------+--------+---------+----------+
> >  | Redis  | 2 chl, IAA  | 62737  | 115    | 4547    | 387911   |
> >  |workload+-------------+--------+--------+---------+----------+
> >  |        | 4 chl, Zlib | 30% failure, timeout is 120s         |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 4 chl, IAA  | 54645  | 177    | 5382    | 656865   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 8 chl, Zlib | 93488  | 74     | 1264    | 129486   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 8 chl, IAA  | 24367  | 303    | 6901    | 964380   |
> >  +--------+-------------+--------+--------+---------+----------+
> 
> The redis results look much more preferred on using IAA comparing to the
> idle tests.  Does it mean that IAA works less good with zero pages in
> general (assuming that'll be the majority in idle test)?
Both Idle and Redis data are not the best performance for IAA since it is based on multifd packet streaming compression.
In the idle case, most pages are indeed zero page, zero page compression is not as good as only detecting zero pages, so the compression advantage is not reflected.

> From the manual, I see that IAA also supports encryption/decryption.
> Would it be able to accelerate TLS?
From Sapphire Rapids(SPR)/Emerald Rapids (EMR) Xeon servers, IAA can't support encryption/decryption. This feature may be available in future generations
For TLS acceleration, QAT supports this function on SPR/EMR and has successful cases in some scenarios.
https://www.intel.cn/content/www/cn/zh/developer/articles/guide/nginx-https-with-qat-tuning-guide.html

> How should one consider IAA over QAT?  What is the major difference?  I
> see that IAA requires IOMMU scalable mode, why?  Is it because the IAA HW
> is something attached to the pcie bus (assume QAT the same)?

Regarding the difference between using IAA or QAT for compression
1. IAA is more suitable for 4K compression, and QAT is suitable for large block data compression. This is determined by the deflate windows size, and QAT can support more compression levels. IAA hardware supports 1 compression level.
2. From the perspective of throughput, one IAA device supports compression throughput is 4GBps and decompression is 30GBps. One QAT support compression or decompression throughput is 20GBps.
3. Depending on the product type selected by the customer and the deployment, the resources used for live migration will also be different.

Regarding the IOMMU scalable mode
1. The current IAA software stack requires Shared Virtual Memory (SVM) technology, and SVM depends on IOMMU scalable mode.
2. Both IAA and QAT support PCIe PASID capability, then IAA can support shared work queue.
https://docs.kernel.org/next/x86/sva.html

> Thanks,
> 
> --
> Peter Xu
Peter Xu Jan. 30, 2024, 10:32 a.m. UTC | #3
On Tue, Jan 30, 2024 at 03:56:05AM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Monday, January 29, 2024 6:43 PM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou,
> > Nanhai <nanhai.zou@intel.com>
> > Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA
> > Compression
> > 
> > On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote:
> > > Hi,
> > 
> > Hi, Yuan,
> > 
> > I have a few comments and questions.  Many of them can be pure questions
> > as I don't know enough on these new technologies.
> > 
> > >
> > > I am writing to submit a code change aimed at enhancing live migration
> > > acceleration by leveraging the compression capability of the Intel
> > > In-Memory Analytics Accelerator (IAA).
> > >
> > > The implementation of the IAA (de)compression code is based on Intel
> > > Query Processing Library (QPL), an open-source software project
> > > designed for IAA high-level software programming.
> > > https://github.com/intel/qpl
> > >
> > > In the last version, there was some discussion about whether to
> > > introduce a new compression algorithm for IAA. Because the compression
> > > algorithm of IAA hardware is based on deflate, and QPL already
> > > supports Zlib, so in this version, I implemented IAA as an accelerator
> > > for the Zlib compression method. However, due to some reasons, QPL is
> > > currently not compatible with the existing Zlib method that Zlib
> > > compressed data can be decompressed by QPl and vice versa.
> > >
> > > I have some concerns about the existing Zlib compression
> > >   1. Will you consider supporting one channel to support multi-stream
> > >      compression? Of course, this may lead to a reduction in compression
> > >      ratio, but it will allow the hardware to process each stream
> > >      concurrently. We can have each stream process multiple pages,
> > >      reducing the loss of compression ratio. For example, 128 pages are
> > >      divided into 16 streams for independent compression. I will provide
> > >      the a early performance data in the next version(v4).
> > 
> > I think Juan used to ask similar question: how much this can help if
> > multifd can already achieve some form of concurrency over the pages?
> 
> 
> > Couldn't the user specify more multifd channels if they want to grant more
> > cpu resource for comp/decomp purpose?
> > 
> > IOW, how many concurrent channels QPL can provide?  What is the suggested
> > concurrency channels there?
> 
> From the QPL software, there is no limit on the number of concurrent compression and decompression tasks.
> From the IAA hardware, one IAA physical device can process two compressions concurrently or eight decompression tasks concurrently. There are up to 8 IAA devices on an Intel SPR Server and it will vary according to the customer’s product selection and deployment.
> 
> Regarding the requirement for the number of concurrent channels, I think this may not be a bottleneck problem.
> Please allow me to introduce a little more here
> 
> 1. If the compression design is based on Zlib/Deflate/Gzip streaming mode, then we indeed need more channels to maintain concurrent processing. Because each time a multifd packet is compressed (including 128 independent pages), it needs to be compressed page by page. These 128 pages are not concurrent. The concurrency is reflected in the logic of multiple channels for the multifd packet.

Right.  However since you said there're only a max of 8 IAA devices, would
it also mean n_multifd_threads=8 can be a good enough scenario to achieve
proper concurrency, no matter the size of data chunk for one compression
request?

Maybe you meant each device can still process concurrent compression
requests, so the real capability of concurrency can be much larger than 8?

> 
> 2. Through testing, we prefer concurrent processing on 4K pages, not multifd packet, which means that 128 pages belonging to a packet can be compressed/decompressed concurrently. Even one channel can also utilize all the resources of IAA. But this is not compatible with existing zlib.
> The code is similar to the following
>   for(int i = 0; i < num_pages; i++) {
>     job[i]->input_data = pages[i]
>     submit_job(job[i] //Non-block submit for compression/decompression tasks
>   }
>   for(int i = 0; i < num_pages; i++) {
>     wait_job(job[i])  //busy polling. In the future, we will make this part and data sending into pipeline mode.
>   } 

Right, if more concurrency is wanted, you can use this async model; I think
Juan used to suggest such and I agree it will also work.  It can be done on
top of the basic functionality merged.

> 
> 3. Currently, the patches we provide to the community are based on streaming compression. This is to be compatible with the current zlib method. However, we found that there are still many problems with this, so we plan to provide a new change in the next version that the independent QPL/IAA acceleration function as said above.
> Compatibility issues include the following
>     1. QPL currently does not support the z_sync_flush operation
>     2. IAA comp/decomp window is fixed 4K. By default, the zlib window size is 32K. And window size should be the same for Both comp/decomp sides. 
>     3. At the same time, I researched the QAT compression scheme. QATzip currently does not support zlib, nor does it support z_sync_flush. The window size is 32K
> 
> In general, I think it is a good suggestion to make the accelerator compatible with standard compression algorithms, but also let the accelerator run independently, thus avoiding some compatibility and performance problems of the accelerator. For example, we can add the "accel" option to the compression method, and then the user must specify the same accelerator by compression accelerator parameter on the source and remote ends (just like specifying the same compression algorithm)
> 
> > >
> > >   2. Will you consider using QPL/IAA as an independent compression
> > >      algorithm instead of an accelerator? In this way, we can better
> > >      utilize hardware performance and some features, such as IAA's
> > >      canned mode, which can be dynamically generated by some statistics
> > >      of data. A huffman table to improve the compression ratio.
> > 
> > Maybe one more knob will work?  If it's not compatible with the deflate
> > algo maybe it should never be the default.  IOW, the accelerators may be
> > extended into this (based on what you already proposed):
> > 
> >   - auto ("qpl" first, "none" second; never "qpl-optimized")
> >   - none (old zlib)
> >   - qpl (qpl compatible)
> >   - qpl-optimized (qpl uncompatible)
> > 
> > Then "auto"/"none"/"qpl" will always be compatible, only the last doesn't,
> > user can select it explicit, but only on both sides of QEMU.
> Yes, this is what I want, I need a way that QPL is not compatible with zlib. From my current point of view, if zlib chooses raw defalte mode, then QAT will be compatible with the current community's zlib solution.
> So my suggestion is as follows
> 
> Compression method parameter
>  - none
>  - zlib
>  - zstd
>  - accel (Both Qemu sides need to select the same accelerator from "Compression accelerator parameter" explicitly).

Can we avoid naming it as "accel"?  It's too generic, IMHO.

If it's a special algorithm that only applies to QPL, can we just call it
"qpl" here?  Then...

> 
> Compression accelerator parameter
>  - auto
>  - none
>  - qpl (qpl will not support zlib/zstd, it will inform an error when zlib/zstd is selected)
>  - qat (it can provide acceleration of zlib/zstd)

Here IMHO we don't need qpl then, because the "qpl" compression method can
enforce an hardware accelerator.  In summary, not sure whether this works;

Compression methods: none, zlib, zstd, qpl (describes all the algorithms
that might be used; again, qpl enforces HW support).

Compression accelerators: auto, none, qat (only applies when zlib/zstd
chosen above)

> 
> > > Test condition:
> > >   1. Host CPUs are based on Sapphire Rapids, and frequency locked to
> > 3.4G
> > >   2. VM type, 16 vCPU and 64G memory
> > >   3. The Idle workload means no workload is running in the VM
> > >   4. The Redis workload means YCSB workloadb + Redis Server are running
> > >      in the VM, about 20G or more memory will be used.
> > >   5. Source side migartion configuration commands
> > >      a. migrate_set_capability multifd on
> > >      b. migrate_set_parameter multifd-channels 2/4/8
> > >      c. migrate_set_parameter downtime-limit 300
> > >      d. migrate_set_parameter multifd-compression zlib
> > >      e. migrate_set_parameter multifd-compression-accel none/qpl
> > >      f. migrate_set_parameter max-bandwidth 100G
> > >   6. Desitination side migration configuration commands
> > >      a. migrate_set_capability multifd on
> > >      b. migrate_set_parameter multifd-channels 2/4/8
> > >      c. migrate_set_parameter multifd-compression zlib
> > >      d. migrate_set_parameter multifd-compression-accel none/qpl
> > >      e. migrate_set_parameter max-bandwidth 100G
> > 
> > How is zlib-level setup?  Default (1)?
> Yes, use level 1 the default level.
> 
> > Btw, it seems both zlib/zstd levels are not even working right now to be
> > configured.. probably overlooked in migrate_params_apply().
> Ok, I will check this.

Thanks.  If you plan to post patch, please attach:

Reported-by: Xiaohui Li <xiaohli@redhat.com>

As that's reported by our QE team.

Maybe you can already add an unit test (migration-test.c, under tests/)
which should expose this issue already, by setting z*-level to non-1 then
query it back, asserting that the value did change.

> 
> > > Early migration result, each result is the average of three tests
> > > +--------+-------------+--------+--------+---------+----+-----+
> > >  |        | The number  |total   |downtime|network  |pages per |
> > >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> > >  |        | and mode    |        |        |(mbps)   |          |
> > >  |        +-------------+-----------------+---------+----------+
> > >  |        | 2 chl, Zlib | 20647  | 22     | 195     | 137767   |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  | Idle   | 2 chl, IAA  | 17022  | 36     | 286     | 460289   |
> > >  |workload+-------------+--------+--------+---------+----------+
> > >  |        | 4 chl, Zlib | 18835  | 29     | 241     | 299028   |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  |        | 4 chl, IAA  | 16280  | 32     | 298     | 652456   |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  |        | 8 chl, Zlib | 17379  | 32     | 275     | 470591   |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  |        | 8 chl, IAA  | 15551  | 46     | 313     | 1315784  |
> > 
> > The number is slightly confusing to me.  If IAA can send 3x times more
> > pages per-second, shouldn't the total migration time 1/3 of the other if
> > the guest is idle?  But the total times seem to be pretty close no matter
> > N of channels. Maybe I missed something?
> 
> This data is the information read from "info migrate" after the live migration status changes to "complete".
> I think it is the max throughout when expected downtime and network available bandwidth are met.
> In vCPUs are idle, live migration does not run at maximum throughput for too long.
> 
> > >  +--------+-------------+--------+--------+---------+----------+
> > >
> > >  +--------+-------------+--------+--------+---------+----+-----+
> > >  |        | The number  |total   |downtime|network  |pages per |
> > >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> > >  |        | and mode    |        |        |(mbps)   |          |
> > >  |        +-------------+-----------------+---------+----------+
> > >  |        | 2 chl, Zlib | 100% failure, timeout is 120s        |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  | Redis  | 2 chl, IAA  | 62737  | 115    | 4547    | 387911   |
> > >  |workload+-------------+--------+--------+---------+----------+
> > >  |        | 4 chl, Zlib | 30% failure, timeout is 120s         |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  |        | 4 chl, IAA  | 54645  | 177    | 5382    | 656865   |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  |        | 8 chl, Zlib | 93488  | 74     | 1264    | 129486   |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  |        | 8 chl, IAA  | 24367  | 303    | 6901    | 964380   |
> > >  +--------+-------------+--------+--------+---------+----------+
> > 
> > The redis results look much more preferred on using IAA comparing to the
> > idle tests.  Does it mean that IAA works less good with zero pages in
> > general (assuming that'll be the majority in idle test)?
> Both Idle and Redis data are not the best performance for IAA since it is based on multifd packet streaming compression.
> In the idle case, most pages are indeed zero page, zero page compression is not as good as only detecting zero pages, so the compression advantage is not reflected.
> 
> > From the manual, I see that IAA also supports encryption/decryption.
> > Would it be able to accelerate TLS?
> From Sapphire Rapids(SPR)/Emerald Rapids (EMR) Xeon servers, IAA can't support encryption/decryption. This feature may be available in future generations
> For TLS acceleration, QAT supports this function on SPR/EMR and has successful cases in some scenarios.
> https://www.intel.cn/content/www/cn/zh/developer/articles/guide/nginx-https-with-qat-tuning-guide.html
> 
> > How should one consider IAA over QAT?  What is the major difference?  I
> > see that IAA requires IOMMU scalable mode, why?  Is it because the IAA HW
> > is something attached to the pcie bus (assume QAT the same)?
> 
> Regarding the difference between using IAA or QAT for compression
> 1. IAA is more suitable for 4K compression, and QAT is suitable for large block data compression. This is determined by the deflate windows size, and QAT can support more compression levels. IAA hardware supports 1 compression level.
> 2. From the perspective of throughput, one IAA device supports compression throughput is 4GBps and decompression is 30GBps. One QAT support compression or decompression throughput is 20GBps.
> 3. Depending on the product type selected by the customer and the deployment, the resources used for live migration will also be different.
> 
> Regarding the IOMMU scalable mode
> 1. The current IAA software stack requires Shared Virtual Memory (SVM) technology, and SVM depends on IOMMU scalable mode.
> 2. Both IAA and QAT support PCIe PASID capability, then IAA can support shared work queue.
> https://docs.kernel.org/next/x86/sva.html

Thanks for all these information.  I'm personally still curious why Intel
would like to provide two new technology to service similar purposes merely
at the same time window.

Could you put many of these information into a doc file?  It can be
docs/devel/migration/QPL.rst.

Also, we may want an unit test to cover the new stuff when the whole design
settles. It may cover all mode supported, but for sure we can skip hw
accelerated use case.
Yuan Liu Jan. 31, 2024, 2:08 a.m. UTC | #4
> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Tuesday, January 30, 2024 6:32 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou,
> Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA
> Compression
> 
> On Tue, Jan 30, 2024 at 03:56:05AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Monday, January 29, 2024 6:43 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou,
> > > Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA
> > > Compression
> > >
> > > On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote:
> > > > Hi,
> > >
> > > Hi, Yuan,
> > >
> > > I have a few comments and questions.  Many of them can be pure
> > > questions as I don't know enough on these new technologies.
> > >
> > > >
> > > > I am writing to submit a code change aimed at enhancing live
> > > > migration acceleration by leveraging the compression capability of
> > > > the Intel In-Memory Analytics Accelerator (IAA).
> > > >
> > > > The implementation of the IAA (de)compression code is based on
> > > > Intel Query Processing Library (QPL), an open-source software
> > > > project designed for IAA high-level software programming.
> > > > https://github.com/intel/qpl
> > > >
> > > > In the last version, there was some discussion about whether to
> > > > introduce a new compression algorithm for IAA. Because the
> > > > compression algorithm of IAA hardware is based on deflate, and QPL
> > > > already supports Zlib, so in this version, I implemented IAA as an
> > > > accelerator for the Zlib compression method. However, due to some
> > > > reasons, QPL is currently not compatible with the existing Zlib
> > > > method that Zlib compressed data can be decompressed by QPl and vice
> versa.
> > > >
> > > > I have some concerns about the existing Zlib compression
> > > >   1. Will you consider supporting one channel to support multi-
> stream
> > > >      compression? Of course, this may lead to a reduction in
> compression
> > > >      ratio, but it will allow the hardware to process each stream
> > > >      concurrently. We can have each stream process multiple pages,
> > > >      reducing the loss of compression ratio. For example, 128 pages
> are
> > > >      divided into 16 streams for independent compression. I will
> provide
> > > >      the a early performance data in the next version(v4).
> > >
> > > I think Juan used to ask similar question: how much this can help if
> > > multifd can already achieve some form of concurrency over the pages?
> >
> >
> > > Couldn't the user specify more multifd channels if they want to
> > > grant more cpu resource for comp/decomp purpose?
> > >
> > > IOW, how many concurrent channels QPL can provide?  What is the
> > > suggested concurrency channels there?
> >
> > From the QPL software, there is no limit on the number of concurrent
> compression and decompression tasks.
> > From the IAA hardware, one IAA physical device can process two
> compressions concurrently or eight decompression tasks concurrently. There
> are up to 8 IAA devices on an Intel SPR Server and it will vary according
> to the customer’s product selection and deployment.
> >
> > Regarding the requirement for the number of concurrent channels, I think
> this may not be a bottleneck problem.
> > Please allow me to introduce a little more here
> >
> > 1. If the compression design is based on Zlib/Deflate/Gzip streaming
> mode, then we indeed need more channels to maintain concurrent processing.
> Because each time a multifd packet is compressed (including 128
> independent pages), it needs to be compressed page by page. These 128
> pages are not concurrent. The concurrency is reflected in the logic of
> multiple channels for the multifd packet.
> 
> Right.  However since you said there're only a max of 8 IAA devices, would
> it also mean n_multifd_threads=8 can be a good enough scenario to achieve
> proper concurrency, no matter the size of data chunk for one compression
> request?
> 
> Maybe you meant each device can still process concurrent compression
> requests, so the real capability of concurrency can be much larger than 8?

Yes, the number of concurrent requests can be greater than 8, one device can 
handle 2 compression requests or 8 decompression requests concurrently. 

> >
> > 2. Through testing, we prefer concurrent processing on 4K pages, not
> multifd packet, which means that 128 pages belonging to a packet can be
> compressed/decompressed concurrently. Even one channel can also utilize
> all the resources of IAA. But this is not compatible with existing zlib.
> > The code is similar to the following
> >   for(int i = 0; i < num_pages; i++) {
> >     job[i]->input_data = pages[i]
> >     submit_job(job[i] //Non-block submit for compression/decompression
> tasks
> >   }
> >   for(int i = 0; i < num_pages; i++) {
> >     wait_job(job[i])  //busy polling. In the future, we will make this
> part and data sending into pipeline mode.
> >   }
> 
> Right, if more concurrency is wanted, you can use this async model; I
> think Juan used to suggest such and I agree it will also work.  It can be
> done on top of the basic functionality merged.

Sure, I think we can show the better performance based on it.

> > 3. Currently, the patches we provide to the community are based on
> streaming compression. This is to be compatible with the current zlib
> method. However, we found that there are still many problems with this, so
> we plan to provide a new change in the next version that the independent
> QPL/IAA acceleration function as said above.
> > Compatibility issues include the following
> >     1. QPL currently does not support the z_sync_flush operation
> >     2. IAA comp/decomp window is fixed 4K. By default, the zlib window
> size is 32K. And window size should be the same for Both comp/decomp
> sides.
> >     3. At the same time, I researched the QAT compression scheme.
> > QATzip currently does not support zlib, nor does it support
> > z_sync_flush. The window size is 32K
> >
> > In general, I think it is a good suggestion to make the accelerator
> > compatible with standard compression algorithms, but also let the
> > accelerator run independently, thus avoiding some compatibility and
> > performance problems of the accelerator. For example, we can add the
> > "accel" option to the compression method, and then the user must
> > specify the same accelerator by compression accelerator parameter on
> > the source and remote ends (just like specifying the same compression
> > algorithm)
> >
> > > >
> > > >   2. Will you consider using QPL/IAA as an independent compression
> > > >      algorithm instead of an accelerator? In this way, we can better
> > > >      utilize hardware performance and some features, such as IAA's
> > > >      canned mode, which can be dynamically generated by some
> statistics
> > > >      of data. A huffman table to improve the compression ratio.
> > >
> > > Maybe one more knob will work?  If it's not compatible with the
> > > deflate algo maybe it should never be the default.  IOW, the
> > > accelerators may be extended into this (based on what you already
> proposed):
> > >
> > >   - auto ("qpl" first, "none" second; never "qpl-optimized")
> > >   - none (old zlib)
> > >   - qpl (qpl compatible)
> > >   - qpl-optimized (qpl uncompatible)
> > >
> > > Then "auto"/"none"/"qpl" will always be compatible, only the last
> > > doesn't, user can select it explicit, but only on both sides of QEMU.
> > Yes, this is what I want, I need a way that QPL is not compatible with
> zlib. From my current point of view, if zlib chooses raw defalte mode,
> then QAT will be compatible with the current community's zlib solution.
> > So my suggestion is as follows
> >
> > Compression method parameter
> >  - none
> >  - zlib
> >  - zstd
> >  - accel (Both Qemu sides need to select the same accelerator from
> "Compression accelerator parameter" explicitly).
> 
> Can we avoid naming it as "accel"?  It's too generic, IMHO.
> 
> If it's a special algorithm that only applies to QPL, can we just call it
> "qpl" here?  Then...

Yes, I agree.

> > Compression accelerator parameter
> >  - auto
> >  - none
> >  - qpl (qpl will not support zlib/zstd, it will inform an error when
> > zlib/zstd is selected)
> >  - qat (it can provide acceleration of zlib/zstd)
> 
> Here IMHO we don't need qpl then, because the "qpl" compression method can
> enforce an hardware accelerator.  In summary, not sure whether this works;
> 
> Compression methods: none, zlib, zstd, qpl (describes all the algorithms
> that might be used; again, qpl enforces HW support).
> 
> Compression accelerators: auto, none, qat (only applies when zlib/zstd
> chosen above)

I agree, QPL will dynamically detect IAA hardware resources and prioritize 
hardware acceleration. If IAA is not available, QPL can also provide an 
efficient deflate-based compression algorithm. And the software and hardware 
are fully compatible.

> > > > Test condition:
> > > >   1. Host CPUs are based on Sapphire Rapids, and frequency locked
> > > > to
> > > 3.4G
> > > >   2. VM type, 16 vCPU and 64G memory
> > > >   3. The Idle workload means no workload is running in the VM
> > > >   4. The Redis workload means YCSB workloadb + Redis Server are
> running
> > > >      in the VM, about 20G or more memory will be used.
> > > >   5. Source side migartion configuration commands
> > > >      a. migrate_set_capability multifd on
> > > >      b. migrate_set_parameter multifd-channels 2/4/8
> > > >      c. migrate_set_parameter downtime-limit 300
> > > >      d. migrate_set_parameter multifd-compression zlib
> > > >      e. migrate_set_parameter multifd-compression-accel none/qpl
> > > >      f. migrate_set_parameter max-bandwidth 100G
> > > >   6. Desitination side migration configuration commands
> > > >      a. migrate_set_capability multifd on
> > > >      b. migrate_set_parameter multifd-channels 2/4/8
> > > >      c. migrate_set_parameter multifd-compression zlib
> > > >      d. migrate_set_parameter multifd-compression-accel none/qpl
> > > >      e. migrate_set_parameter max-bandwidth 100G
> > >
> > > How is zlib-level setup?  Default (1)?
> > Yes, use level 1 the default level.
> >
> > > Btw, it seems both zlib/zstd levels are not even working right now
> > > to be configured.. probably overlooked in migrate_params_apply().
> > Ok, I will check this.
> 
> Thanks.  If you plan to post patch, please attach:
> 
> Reported-by: Xiaohui Li <xiaohli@redhat.com>
> 
> As that's reported by our QE team.
> 
> Maybe you can already add an unit test (migration-test.c, under tests/)
> which should expose this issue already, by setting z*-level to non-1 then
> query it back, asserting that the value did change.

Thanks for your suggestions, I will improve the test part of the code

> > > > Early migration result, each result is the average of three tests
> > > > +--------+-------------+--------+--------+---------+----+-----+
> > > >  |        | The number  |total   |downtime|network  |pages per |
> > > >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> > > >  |        | and mode    |        |        |(mbps)   |          |
> > > >  |        +-------------+-----------------+---------+----------+
> > > >  |        | 2 chl, Zlib | 20647  | 22     | 195     | 137767   |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  | Idle   | 2 chl, IAA  | 17022  | 36     | 286     | 460289   |
> > > >  |workload+-------------+--------+--------+---------+----------+
> > > >  |        | 4 chl, Zlib | 18835  | 29     | 241     | 299028   |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  |        | 4 chl, IAA  | 16280  | 32     | 298     | 652456   |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  |        | 8 chl, Zlib | 17379  | 32     | 275     | 470591   |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  |        | 8 chl, IAA  | 15551  | 46     | 313     | 1315784  |
> > >
> > > The number is slightly confusing to me.  If IAA can send 3x times
> > > more pages per-second, shouldn't the total migration time 1/3 of the
> > > other if the guest is idle?  But the total times seem to be pretty
> > > close no matter N of channels. Maybe I missed something?
> >
> > This data is the information read from "info migrate" after the live
> migration status changes to "complete".
> > I think it is the max throughout when expected downtime and network
> available bandwidth are met.
> > In vCPUs are idle, live migration does not run at maximum throughput for
> too long.
> >
> > > >  +--------+-------------+--------+--------+---------+----------+
> > > >
> > > >  +--------+-------------+--------+--------+---------+----+-----+
> > > >  |        | The number  |total   |downtime|network  |pages per |
> > > >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> > > >  |        | and mode    |        |        |(mbps)   |          |
> > > >  |        +-------------+-----------------+---------+----------+
> > > >  |        | 2 chl, Zlib | 100% failure, timeout is 120s        |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  | Redis  | 2 chl, IAA  | 62737  | 115    | 4547    | 387911   |
> > > >  |workload+-------------+--------+--------+---------+----------+
> > > >  |        | 4 chl, Zlib | 30% failure, timeout is 120s         |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  |        | 4 chl, IAA  | 54645  | 177    | 5382    | 656865   |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  |        | 8 chl, Zlib | 93488  | 74     | 1264    | 129486   |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  |        | 8 chl, IAA  | 24367  | 303    | 6901    | 964380   |
> > > >  +--------+-------------+--------+--------+---------+----------+
> > >
> > > The redis results look much more preferred on using IAA comparing to
> > > the idle tests.  Does it mean that IAA works less good with zero
> > > pages in general (assuming that'll be the majority in idle test)?
> > Both Idle and Redis data are not the best performance for IAA since it
> is based on multifd packet streaming compression.
> > In the idle case, most pages are indeed zero page, zero page compression
> is not as good as only detecting zero pages, so the compression advantage
> is not reflected.
> >
> > > From the manual, I see that IAA also supports encryption/decryption.
> > > Would it be able to accelerate TLS?
> > From Sapphire Rapids(SPR)/Emerald Rapids (EMR) Xeon servers, IAA can't
> > support encryption/decryption. This feature may be available in future
> generations For TLS acceleration, QAT supports this function on SPR/EMR
> and has successful cases in some scenarios.
> > https://www.intel.cn/content/www/cn/zh/developer/articles/guide/nginx-
> > https-with-qat-tuning-guide.html
> >
> > > How should one consider IAA over QAT?  What is the major difference?
> > > I see that IAA requires IOMMU scalable mode, why?  Is it because the
> > > IAA HW is something attached to the pcie bus (assume QAT the same)?
> >
> > Regarding the difference between using IAA or QAT for compression 1.
> > IAA is more suitable for 4K compression, and QAT is suitable for large
> block data compression. This is determined by the deflate windows size,
> and QAT can support more compression levels. IAA hardware supports 1
> compression level.
> > 2. From the perspective of throughput, one IAA device supports
> compression throughput is 4GBps and decompression is 30GBps. One QAT
> support compression or decompression throughput is 20GBps.
> > 3. Depending on the product type selected by the customer and the
> deployment, the resources used for live migration will also be different.
> >
> > Regarding the IOMMU scalable mode
> > 1. The current IAA software stack requires Shared Virtual Memory (SVM)
> technology, and SVM depends on IOMMU scalable mode.
> > 2. Both IAA and QAT support PCIe PASID capability, then IAA can support
> shared work queue.
> > https://docs.kernel.org/next/x86/sva.html
> 
> Thanks for all these information.  I'm personally still curious why Intel
> would like to provide two new technology to service similar purposes
> merely at the same time window.
> 
> Could you put many of these information into a doc file?  It can be
> docs/devel/migration/QPL.rst.

Sure, I will update the documentation

> Also, we may want an unit test to cover the new stuff when the whole
> design settles. It may cover all mode supported, but for sure we can skip
> hw accelerated use case.

For QPL, I think this is not a problem. QPL is used as a new compression 
method that can be used when hardware accelerators are not available